OOPS: How a typo took down S3, the backbone of the internet.
On Tuesday morning, members of the S3 team were debugging the billing system. As part of that, the team needed to take a small number of servers offline. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” Amazon said. “The servers that were inadvertently removed supported two other S3 subsystems.”
The subsystems were important. One of them “manages the metadata and location information of all S3 objects in the region,” Amazon said. Without it, services that depend on it couldn’t perform basic data retrieval and storage tasks.
After accidentally taking the servers offline, the various systems had to do “a full restart,” which apparently takes longer than it does on your laptop. While S3 was down, a variety of other Amazon web services stopped functioning, including Amazon’s Elastic Compute Cloud (EC2), which is also popular with internet companies that need to rapidly expand their storage.
Amazon said S3 was designed to be able to handle losing a few servers. What it had more trouble handling was the massive restart.
Paula Bolyard emailed to tell me that PJM (Instapundit included) experienced a brief slowdown, but that “our tech team was able to create a quick workaround and regular service was restored almost immediately.”
We apologize, of course, if there was any inconvenience.