Amazon Web Services outage caused by a single wrong command

by Zak Killian — 4:44 PM on March 2, 2017

Over the last couple of days you may have noticed some of your favorite websites having problems, or even simply being down altogether. The source of the issue was a failure of two key subsystems in the Amazon Simple Storage Service, better known as S3. A message straight from AWS says the issue was caused when an “authorized employee working from an established playbook” entered an improper command and removed a much larger set of servers than intended from a pool supporting the S3 index and placement subsystems.

Affected sites included huge swathes of the web. We at TR were mostly unaffected, but BusinessInsider, Quora, Imgur, Giphy, and the file upload features of many services (including Slack and Discord) were disrupted. Folks with internet-of-things hardware like thermostats and lightbulbs were unable to control them as well. Amazon’s Alexa service was equally disrupted, leaving Echo devices as little more than fancy paperweights for the duration of the outage.

Despite the fact that the outage was localized to the northern part of Virginia, several Amazon systems were heavily dependent on that datacenter and ultimately failed. As a result, services that relied on those systems—even the AWS service health dashboard itself—were affected worldwide. Amazon says it’s already introduced measures to distribute and de-localize those services to keep this sort of thing from happening again. The company also says it has made changes to the way it controls capacity allocation to prevent such a large outage from happening so quickly and so easily.