Written by Benny Buljko, Network Administrator
On Tuesday, February 28th, Amazon’s AWS (Amazon Web Services) were taken offline causing millions of people to panic for over 4 hours. Since Amazon Web Services are used across several platforms worldwide, many people were affected. Several companies such as the likes of Netflix, Spotify, Pinterest, and Buzzfeed relay on Amazon’s servers to be available and active on a 24/7 basis. In addition to this, about 54 of the largest online retailers also rely on Amazon’s services, causing a significant dip in profits for the 4 hours that the services were unavailable.
Although Amazon is proud of their long run without severe service degradation, they received a massive hit from the media after this debacle. Amazon’s representatives have come out and admitted that human-error was the root cause of the service outage.
"At 9:37AM PST [Feb. 28], an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," Amazon said. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems."
The two subsystems affected included an index subsystem that manages the metadata and location information for all S3 objects in the region. The second subsystem was a placement subsystem that manages allocation of new storage.
Amazon have now began implementing a new policy which adds additional safeguards to prevent events like this from occurring again. They have also announced that they will be working to minimalize recovery time if an unexpected outage does occur again.