What Can Go Wrong Will Go Wrong

Over the weekend I had some planned maintenance to YackTrack. If you read that post, you noticed that I had problems and updated the post a few times. I had a very detailed plan of what I needed to do, which I followed to the letter. However, regardless of how much testing I did, the maintenance did not go as planned. In particular, some of the SQL updates that I was doing were failing when applied to the application environment. Thankfully, I am comfortable with databases and was able to get around the issues, but it made me think about what could have gone wrong and what you need to do to avoid these things.

  1. If you are working with code or a database, backup everything before starting. I try to take regular backups, and I did have a database backup prior to starting the maintenance. This also gives you a rollback point in case something catastrophic occurs.
  2. Have a detailed plan of everything you are doing. I had several steps that I needed to complete. Some of the steps were code related and others were database related. You need to know if some of the steps can be run in a different order in case something fails. During the maintenance, there was one database script that was particularly problematic. Thankfully, none of the other scripts were dependent upon the results, so I could run the other scripts while I figured out what was wrong with the problem script.
  3. Make sure the plan takes into account failure points. Where could the plan fail? Where are the logical separations between tasks? What tasks are the most likely to fail? If you start thinking about the worst case scenario, you will have planned for it.
  4. Make sure you have a rollback or restore plan. What happens if you cannot recover from the failures? Hopefully, you have the previous version of the code and a database backup from point #1 above. They can now be used to restore the application and the database to its prior state. Also, try to make sure the rollback plan is as simple as possible. You do not want your rollback plan to fail.
  5. Test everything in your plan. Test every task in your plan to ensure that it will run correctly.
  6. Test everything in your plan again. Yes, test everything at least twice. More importantly, test your plan in the simplest possible way in the previous step, and then do performance testing in this step. This will ensure that a database update will not run for significantly longer than expected, or some other lengthy delay will not occur.
  7. During the maintenance, do not get distracted. This did not happen this weekend, but I have seen it occur before. Typically, when people are updating an application there is the list of tasks that need to be completed. Sometimes, the focus on the tasks is not as complete as is necessary and things get skipped. Or in your rush to finish, you miss a detail that is very important. There is a balance needed between focus and the desire to finish so that mistakes are not made.
  8. Celebrate, but not too early. When I finally completed the maintenance, I “celebrated”. I did not have a party, but I did breathe a big sigh of relief. What I did not realize was that there was a section of untested code in my Identi.ca support that was causing a defect in both the URL and Chatter searches. I saw it early for the URL search, but did not realize until the next day that Chatter searches had the same defect.
  9. Do not be afraid to admit defeat or mistakes. When I saw the defect in the Identi.ca support, I quickly pulled it. There was no need to have a defect tarnishing the reputation of application when it is very simple to pull support for a service. The mistake I made was not testing that problematic section of code. I do not have 100% test code coverage, and do not plan to. However, missing this section caused problems and now I need to go back and start over with Identi.ca.
  10. Learn from the problems and mistakes. I learned that I need to do two things at this point. First, I will be doing more “active” database performance management. By “active” I mean regularly checking the performance of various queries to determine possible issues. Second, I will be increasing my test code coverage. Obviously, my tests are not adequate if they let in the type of defect that caused the Identi.ca problems.

This is my list based on my experiences from this weekend. I have done this several times before, so I am definitely disappointed that it happened. Is there something I am missing? Do you have some interesting tips or tricks?

5 thoughts on “What Can Go Wrong Will Go Wrong

  1. It’s a good list Rob, after years working within infrastructure and design and management I have generally become very good at this kind of check list.

    Having escape routes is the primary concern for me. You always want to be able to go back to what you had before, worst case scenario.

    I also try and combine other updates to allow minimal downtime on products/services.

    Equally some change management, think about what other services might be affected by taking something offline, do you need to notify anyone else?

    Finally communicate what you are doing to those that might use your services, keep them informed as to the benefits of your work and when they can expect something to return (and always add some contingency), keeps everyone happy.

    Doing work like this is always a good learning experience, I’ve always learnt the most from my mistakes!

    Like

  2. Ed,
    This was mostly written with the small startup in mind, though I did not make that clear. If we are talking enterprise, then all of the change management and communication items become much more important. For the enterprise, I would have a much bigger and different list πŸ™‚

    Like

Comments are closed.