Categories: Business

Amazon admits lightning didn’t strike its Dublin Data center, but a series of errors did

Credit:Amazon

In a 4,041 word long statement Amazon admitted today that lightning was not responsible for the massive service outage which brought down its AWS cloud servers, and dozens of sites, last week.

The company admitted that a series of hardware, software, technical, and human, errors brought down and kept down the AWS servers for longer periods were necessary.

Credit:Amazon

Last Monday (August 8th) Amazon claimed that a lightning strike in Dublin, where some of its servers are located, brought down the power supply to the data center. But today the company said that a number of issues resulting from the power failure, the cause of which has yet to be determined, and not a lightning strike caused the prolonged service outage.

“[On] August 7th when our utility provider suffered a failure of an 110kV 10 megawatt transformer. This failure resulted in a total loss of electricity supply to all of their customers connected to this transformer, including a significant portion of the affected AWS Availability Zone. The initial fault diagnosis from our utility provider indicated that a lightning strike caused the transformer to fail. The utility provider now believes it was not a lightning strike, and is continuing to investigate root cause.”

It had been a stormy week in Dublin, part of Google’s €99.9m Montevetro building in the city was damaged overnight on the Wednesday, but the lightning story was quickly called into question. Ireland’s Silicon Republic website reported that ESB, who operate the power supply, said a transistor failure caused the outage. The ESB said, “ESB Networks can confirm that at 18:16 on Sunday August 7th, a number of customers in Citywest lost electricity supply. It’s an unfortunate reality for every power system that faults can occur due to vandalism, storms, fires, floods, plant failure and third-party interference. In this case, the problem was the failure of a 110kV transformer in the Citywest 110kV substation.”

The ESB had initially speculated that a lightning strike had caused the outage but had not released an official statement to that effect.

According to Amazon’s statement the power was lost at 10:41pm Irish time (about 5pm EST), some four hours after the ESB’s claim that the power was lost at 6pm. Backup generators failed to kick in meaning that the data center was running on batteries, to make matters worse the servers continued to operate on full power, quickly depleting the batteries and causing a backlog of data requests as servers went down.

It took Amazon three hours, until 1:50am Irish time to return power to the servers but once the power was returned they discovered that part of their database had become corrupted. Normally the servers would share high volumes of traffic between themselves but as so many of the servers went down this could not be done.

To alleviate the problem Amazon was required to physically “truck in” spare capacity to manage the load, “We brought in additional labor to get more onsite capacity online and trucked in servers from another Availability Zone in the Region. There were delays as this was nighttime in Dublin and the logistics of trucking required mobilizing transportation some distance from the data center. Once the additional capacity was available, we were able to recover the remaining volumes waiting for space to complete a successful re-mirror.”

At the same time the servers were reporting that they contained some data that could be safely deleted, but to due a software error some of this data was, in fact, required. During normal operation Amazon’s servers create snapshots of data which are then deleted after a week. But a hardware failure on the previous week prevented the deletion of this data meaning the servers on the Monday were reporting that this data could be removed,

“In one of the days leading up to the Friday, August 5th deletion run, there was a hardware failure that the snapshot cleanup identification software did not correctly detect and handle. The result was that the list of snapshot references used as input to the cleanup process was incomplete. Because the list of snapshot references was incomplete, the snapshot cleanup identification process incorrectly believed a number of blocks were no longer referenced and had flagged those blocks for deletion even though they were still referenced by customer snapshots. A subsequent run of the snapshot cleanup identification process detected the error and flagged blocks for further analysis that had been incorrectly scheduled for deletion. On August 5th, the engineer running the snapshot deletion process checked the blocks flagged for analysis before running the actual deletion process in the EU West Region. The human checks in this process failed to detect the error and the deletion process was executed. On Friday evening, an error accessing one of the affected snapshots triggered us to investigate.”

Amazon said that human error was responsible for the deletion of this data without it first being properly checked, “The human checks in this process failed to detect the error and the deletion process was executed. On Friday evening, an error accessing one of the affected snapshots triggered us to investigate.”

In the statement Amazon has outlined several procedures it will be putting in place the prevent such failures in the future and apologized for the outage.

This is the second time this year that Amazon’s AWS servers have suffered a major outage. Dublin hosts Europe’s second largest data center; the fifth largest in the world.

Ajit Jain

Ajit Jain is marketing and sales head at Octal Info Solution, a leading iPhone app development company and offering platform to hire Android app developers for your own app development project. He is available to connect on Google Plus, Twitter, Facebook, and LinkedIn.

Next IIA 'overwhelmed' by record number of entries to the 2011 Net Visionary Awards »

Previous « Google Earth imagery update helps uncover part of Scotland's hidden heritage

View Comments

MargaretO'ConnorFlanigan says:

August 17, 2011 at 5:31 AM

Lightening means making something weigh less. You mean lightning not lightEning.
MargaretO'ConnorFlanigan says:

August 17, 2011 at 5:31 AM

Lightening means making something weigh less. You mean lightning not lightEning.
- pdscott says:
  
  August 17, 2011 at 7:57 AM
  
  @MargaretO'ConnorFlanigan ouch, good catch. Thanks Margaret
pdscott says:

August 17, 2011 at 7:57 AM

@MargaretO'ConnorFlanigan ouch, good catch. Thanks Margaret

Genesis Mission to unify US datasets on single platform to feed AI

The road to the Genesis Mission was paved by technocrats like Larry Ellison and Tony…

1 day ago

Government and Policy

G20, B20 promote interoperable digital ID, DPI rollouts

DPI means your digital identity will follow you everywhere — everything you do and say…

2 days ago

Business

From the Dot-Com Bust to the Age of AI: Nisum’s 25-Year Playbook for Sustainable Success

Imtiaz Mohammady, founder and CEO of global technology consulting firm Nisum, doesn’t fit the Silicon…

5 days ago

Technology

Japan moves to build the first 1-million-qubit quantum computer through new industry partnership

The birth of quantum mechanics was accidental, as most scientific discoveries go. Working from the…

7 days ago

Business

New partnerships accelerate digital health as AI continues to redefine orthopedics

The convergence of AI, specialized software, and clinical expertise is creating a new paradigm in…

1 week ago

Technology

Deduction Raises $2.8M To Launch “Taylor, CPAI,” an AI Agent Aiming To Fix America’s Tax Bottleneck

The IRS just confirmed that Direct File — the agency’s short-lived attempt to offer a…

1 week ago

Amazon admits lightning didn’t strike its Dublin Data center, but a series of errors did

View Comments

Recent Posts

Genesis Mission to unify US datasets on single platform to feed AI

G20, B20 promote interoperable digital ID, DPI rollouts

From the Dot-Com Bust to the Age of AI: Nisum’s 25-Year Playbook for Sustainable Success

Japan moves to build the first 1-million-qubit quantum computer through new industry partnership

New partnerships accelerate digital health as AI continues to redefine orthopedics

Deduction Raises $2.8M To Launch “Taylor, CPAI,” an AI Agent Aiming To Fix America’s Tax Bottleneck

Search

Amazon admits lightning didn’t strike its Dublin Data center, but a series of errors did

View Comments

Related Post

Recent Posts

Genesis Mission to unify US datasets on single platform to feed AI

G20, B20 promote interoperable digital ID, DPI rollouts

From the Dot-Com Bust to the Age of AI: Nisum’s 25-Year Playbook for Sustainable Success

Japan moves to build the first 1-million-qubit quantum computer through new industry partnership

New partnerships accelerate digital health as AI continues to redefine orthopedics

Deduction Raises $2.8M To Launch “Taylor, CPAI,” an AI Agent Aiming To Fix America’s Tax Bottleneck

Search