And then it went dark

If you’re an internet junky like me, you might have noticed that since last Friday (June 29) several big sites were down, like Netflix, Instagram, Pinterest and many more. They all made use of Amazon Elastic Cloud service, which was struck by severe weather on Friday night causing the Northern Virginia (US-EAST-1 Region) hub to go down. Almost 12 hours later, Amazon’s Health Status still showed problems within each node of Northern Virginia. But apparently this failure of a single hub caused many big sites to be offline, a really bad thing for a running internet business and could cause major damage to the company’s image and reputation. Question one needs to be ask here: can the promise of cloud solutions be offered in reality? The “cloud” should leverage issues like outages and such, because if one node goes down another node serves as failover.But in this case, the story doesn’t have a nice ending. According to this report on Forbes, the outage for Netflix lasted little under 2 hours. But Pinterest, Instagram and others were still down.

We all want to be on the cloud and invest a lot of effort to build our applications scalable, distributed and lightning fast. But one thing forgotten in all of this is how to leverage failover. The cloud offers us such richness but why are we all betting on a single horse? Have we forgotten the old-school way of setting up a failover system on another service provider “just-in-case”? Aren’t we doing the same thing for our networks?

With Windows Azure you can relax a little more as they provide a 3-way replication of your services world wide. So whenever a node goes down, Windows Azure has your instances replicated on 2 other nodes as well and will redirect traffic until the crashed node revives again. But to be on the safe side of things, keep a cheap virtual hosting service somewhere else where you can at least post updates on services and such.

For everyone who got impacted this weekend, make sure you learn a valuable lesson and distribute your eggs in different baskets.

2 Responses

  • – With Windows Azure you can relax a little more as they provide a 3-way replication of your services world wide. So whenever a node goes down, Windows Azure has your instances replicated on 2 other nodes as well and will redirect traffic until the crashed node revives again.

    This is incorrect. The Windows Azure Storage Service stores blobs, tables and queues locally in 3 local copies and 3 geo-redundant copies (except for queues). The geo-redundant copies will only be accessible following a major datacenter disaster.

    For Cloud Services and Virtual Machines you are on your own although traffic manager will distribute traffic to services in different datacenters if you have already configured them. For data replication for these services you are basically on your own.

Leave a Reply