What Should You Do When Reliable Infrastructure Fails

Yes, that is correct, even reliable infrastructure fails. This has never been more obvious than the past two weeks. The recent celebrity deaths caused massive spikes in traffic to several sites, so of course, some of the sites could not handle the load. However, this is only a small part of what I am talking about.

In the past week, we saw the ever reliable Rackspace go down, taking a whole bunch of sites with it. Just a few days later, Authorize.net went down due to a fire. TechCrunch relates the seriousness of this outage:

Talk about a serious outage. Payment gateway service provider Authorize.net has been down and out for several hours, a number of tipsters inform us. That has big implications: since the service is used by tens of thousands of e-commerce vendors to accept credit card and electronic checks payments on their websites…

Now, imagine if you are an ecommerce website like Toys R Us. Your host, or even your CDN, goes down. If this is during your peak season (Nov. and Dec.), you could lose millions of dollars … in an hour. If your payment processing goes down, the situation becomes more interesting. You cannot make transactions, but your site still looks live. This is potentially more frustrating for customers, who may have been willing to wait for an outage to clear. The “odd errors” customers are likely to see when part of the infrastructure goes down, could cause them to go to another site for their purchases. So, what should you do?

First, if you are a major ecommerce site, you should ensure that your main hosting services are properly redundant. So, if your hosting provider loses a data center, your site should not be impacted. This is not necessary, or affordable,  for smaller sites, but for ecommerce you need to ensure reliability. The main idea with an ecommerce site is that you need to “keep the lights on”. What if your favorite provider does not have redundancy across data centers? In this case, you should look into a smaller and cheaper hosting provider as backup. If you always have your site deployed to a backup server, you can quickly redirect services to the backup site. Granted, their are a lot of pieces in this idea, like the DNS and database servers, but if you have the potential of losing significant revenue due to an outage of one hour you have to take precautions. In some cases, you could even be using these backup servers as external beta servers, so that you do not feel like you are throwing away money.

In the case of Authorize.net’s outage, you can take an example from Twitter. Whenever Twitter has had high loads or general database problems, they turn off a feature like searching. As annoying as that feels from the user perspective, they manage to keep the lights on, but with a limited feature set. Going back to the ecommerce example, if your check processing provider goes down, it would be nice to be able to turn them off, but still accept other forms of payment. Even more impressive would be the ability to quickly switch from one provider to another. Wouldn’t you rather accept payments for a limited number of payment methods, than not accept payments at all? Maybe you can still generate 50% of your normal revenue during that time.

I am assuming most of my readers do not run ecommerce sites, but there is a lot we can learn from these issues. Even a social media application like Twitter wants to maintain as much uptime as possible. So, they turn off search capabilities for a little while. Almost any application can benefit from the ability to turn off a specific feature at any given time. What are you doing for your site to keep the lights on?

4 thoughts on “What Should You Do When Reliable Infrastructure Fails

  1. I was just being a smart Ass on the friendfeed post. But honestly, isn’t this what Business Interruption Insurance is for? Obviously taking whatever preventive measures is prudent, but there’s only so much you can do. Have a Happy 4th of July!


  2. Michael,

    For something like an ecommerce site, “Run” is probably the appropriate answer 🙂

    Insurance only helps with the loss, and I agree there is only so much you can do. A host outage is one of those issues, but what about limiting a feature like Twitter does? As much as we complain about Twitter, they do have a good idea there.


  3. I was just thinking about this, and what do you do if the power grid goes down in California for example. A likely scenario. Your local customers will probably not be able to get to you, but the rest of the world still can. Wouldn’t you want to also spread your risk across geographical area as well? How many large hosting companies offer this kind of service?


Comments are closed.