Is There Ever a “Right” Time for an Outage?

Like many of you, I regularly read the tech press. And one thing that always interests me is outages. Being a Fellow here at Acronis, I am especially interested in outages that could have either easily been avoided or remediated rapidly. Most people in the tech industry were aware of the recent Amazon outage. But I was much more interested in the 18-minute Google Mail outage that happened two weeks before Christmas. To make a long story short, Google rolled out a routine load balancer update. There are fail safes and monitors, but stuff happens and the sequence was: 0845PT apply the patch, 0906 see the problem, 0913 revert the update, 0916 all back to normal. 

How could Google do this so easily? And why would Google do an update during such a peak time?

I also happened to read this blog post around the same time about how many companies do ALL of their patching – firmware, OS, applications – in one go, usually during the Christmas  break. 

Obviously, people test their patches to the best of their ability. But often, the synchronization between virtual and physical,  between one application and another, and one OS, or even versions of an OS, plus hypervisor interactions and versioning all interact to the point that unintended consequences of those patches can keep you down and limping for days, or if you back everything out, can prevent you from ever moving ahead.  

I reject the premise that backing up your virtual machines and hypervisors is enough. You need to back up and test recovery of your entire environment, including standalone physical servers. But I further reject the idea of big-bang patches. I think it’s best to learn how to roll out incremental changes whenever they are ready. 

Don’t think of patches as big ugly things to be afraid of. Think of them as the old way of life and learn to deploy small patches whenever you can improve your company’s business or improve your infrastructure.

Your normal backup and recovery plan must work flawlessly. If you are concerned that you are not able to recover quickly from any update, please post a comment so we can figure out what your concern is and share it with everyone. 

Bottom line: Have a data protection and disaster recovery strategy in place for all your physical and virtual needs, and have a full bare-metal restore capability, even to dissimilar devices. I maintain that image/snapshot with fine-grained, application-aware recovery is the best way to go. But if you disagree, or if you like the big-bang approach, please let me know why. Data growth and server growth continues at an explosive rate, and therefore, data/server backup and recovery becomes more critical every day.

 

What’s your stance: image/snapshot with application-aware recovery or the “big-bang” approach? Share now!