Last Thursday Amazon EC2 suffered a major outage effecting many sites and services hosted on EC2. Friday was a holiday in North America (Good Friday) but the holiday weekend ends on Monday when many I.T. managers are going to to have some explaining to do.
Amazon’s EC2 is one of the largest and highest profile cloud service providers. An outage like this is sure to have many executives, and not just EC2 customers asking some tough questions that managers and engineers responsible for implementing and promoting cloud services will need to answer.
Part of the allure of the cloud has been that you upload your application to the cloud and you will get scalability, high availability and all the benefits of an expensive architecture for dollars a month. This isn’t always true. If your had deployed some critical services to EC2 that were unavailable due to the outage here are some of the questions you will probably be answering in the coming days
I am trying to avoid referencing Amazon specifically because this could (in theory) happen with any cloud provider and customers of other cloud providers might want to start collecting answers to these questions before they have an outage. I am also writing this without having read an in-depth description of exactly what went wrong at EC2.
- Exactly what SLA did my cloud provider promise. Did this recent outage actually count as a violation of my SLA? If so, what is my cloud provider going to do about it. Are they compensating me for my business losses incurred due to the outage? Are they sending me a cheque large enough for me to build my own off-cloud disaster recovery centre? Are they giving me a small refund to cover this months service fees? Or are they just sending me a ‘We are sorry, but we still love you’ email and not charging for the time when they were down and weren’t providing service anyway.
- To what extent did my cloud provider provide me with adequate updates and answer questions during the outage. Did my cloud provider provide updates specific to my cloud instances or were they generic impersonal updates. Did these updates live up to what I was promised in terms of communications during an outage? Was I promised anything?
- When my application is deployed to the cloud what types of failures will I have resilience against. The cloud does not mean that high availability engineering happens automatically. Someone still has to figure these things out. It can be your engineering staff, the clouds providers engineers or a mixture of the two but someone has to be coming up with the planning and procedures for dealing with different types of failures and communicating this to everyone involved. This is not the job of your cloud providers marketing department. If you’ve been letting them do this then you are probably not doing your job
- Does your cloud provider actually have the capacity to deal with an outage at one of their data centers. If they don’t have enough spare servers sitting idle at their other data centers to handle an entire data center going down then they have misrepresented their ability to handle this situation. What tools does your cloud provider give you to monitor their spare capacity? How do you know that your cloud provider isn’t over allocating themselves with respect to disaster recovery gambling on no disasters actually happening
A lot of I.T. managers were big on outsourcing to the cloud because they felt outages would not be their problem, I have news for you, they still are your problem. The only difference is that now you have a lot less control over cloud provider than you did when you were running things in house.
I’m not saying that you can’t build highly available systems on top of the cloud, and I’m not saying that deploying applications into the cloud isn’t sometimes more cost effective than renting your own dedicated servers in data-centers. What I am saying is that if availability and redundancy are important to you (and they aren’t for some applications) then you still need to do the upfront engineering investment and design these things into your architecture and procedures. Doing this requires details about your cloud providers architecture and there internal disaster recovery procedures and options.
When I did my instrument flight training a significant amount of energy was devoted to dealing with emergencies and failures when flying through the clouds. Pilots and aircraft controllers both know what the other is expected to do in an emergency. These actions and procedures are studied and practiced.