I thought about this for a while and even considered bugs that were found in SQL Server itself as the horrifying topic. Given the fact that even horrifying situations and setups come into play, I thought I’d throw out a story from some consulting I’ve done in the past.
When it comes to being a consultant, you find horror and then you find the painful reminder of real horror. Now, before I go on, there is one thing about being a consultant I would never give up; you are exposed to horror. Being exposed equates to experience in fixing these situations in timely and efficient ways. That exposure can be very difficult to gain otherwise. Let’s face it, even knowing the views on IT careers consist of job hopping every year, no one really likes to do that very much. My first job in IT last almost a decade. Now, that consisted of several promotions to different positions within the same company but, I was still exposed to the same infrastructure throughout that time. So consulting gives you many ROI in the respect of the horrors you have the chance to work on.
I spent some time thinking back, laughing, and getting that nasty feeling in my gut and even a nightmare or two, about all the really bad things I’ve had to either fix or reengineer. One really comes to mind and is what I wanted to write about for this post.
I’ve always been an advocate of log shipping for a solid disaster and recovery plan. It’s cheap on both resources and maintenance. Plus, the need for Enterprise Edition is not a factor. This makes log shipping a truly effective selection in the DR world of a small to mid-size business running SQL Server. There is one problem that comes into log shipping though and a step that is absolutely critical; planning. A client, I helped out years ago, thought the same about log shipping and went full steam ahead with putting it in place to secure the data in case of a disaster. This client was slightly different to others as the owner of the business was a friend. So when something was either wrong, going wrong or just didn’t smell right, the owner gave me a call. One night, that call came in because the network they were on was bouncing like a 5 year old with a ping pong ball. I was asked what the risk was if the network dropped for long enough to interrupt the connection to the off-site location where the log shipping was subscribed to. Essentially, there wasn’t much risk other than the last transaction log taken being lost. Given the nature of the business, the systems had already been halted for users connecting. That was a good decision and you’ll see why soon.
My first recommendation was to simply keep a very close watch on the log shipping jobs that copied the TRN backups to the offsite location. If those jobs failed more than once and the network remained down for a lengthy time, it may be time to bring the warm standby instance up as the primary. This would retain business functions the next morning (at this time, it was in the early AM already). Between me, the owner and the acting DBA, the decision was made to do just that, failover. Interesting enough, after all the failover procedures were completed, the instance on the off-site server and database was still inaccessible. Basically, it was still in standby mode. At this point, the DBA was sweating it and asked me to actually look at the instance. What I found was expected; database in standby. However, what I didn’t find was the horror. There wasn’t an active job on the instance to restore any logs from a log shipping setup. Where was it?
Turns out, after the network came back up and I was able to look at the primary source, I found that the production instance setup for log shipping was shipping to a development instance. After calming the DBA down and assuring him the company didn’t lose anything (other than time) we went over his planning steps of setting the log shipping plans up.
1) Log shipping was chosen to be used for DR
2) Log shipping was setup on production
3) Instead of testing on development from the start, DBA decided to just ship the data to development as a test. That’s good enough, right?
4) Something happened. Maybe a football game that night?! DBA never changed anything because log shipping was working.
5) Disaster happens and as we all know, things aren’t tested in many installations until it is a disaster.
In the long run, the setup was resolved and tested the following week with great success. However, if the steps to plan were committed as follows
1) Log shipping was chosen to be used for DR
2) Log shipping was setup between two development instances
3) Log shipping was monitored for a length of time for a baseline estimate in development for production
4) Development failover tests were made
5) Implementation plans set and scripted for moving to test (UAT, QA etc..)
6) Implementation followed through and failover tested again in test.
7) Implementation plans set for production
8) Implemented and failover tested
Really, let’s look at this. There is more time and planning in development to test to production. However, is there more time, money and loss happening from the horrific idea of failing over your life saving DR plans and it only fails you to the local development instance? J Your call but I know what my money is on.
I hope you enjoyed this story. It was a good one and one in which I was able to help mentor a DBA that turned into a really high skilled DBA and one I respect. The DBA did make a mistake but beyond that, accepted the mistake, took the stress and worked through it to make the situation better. If we all followed those simple steps in the event of a disaster and also worked to better ourselves with well-formed planning, we’ll all come out on top in the end.