Saturday, July 26, 2008

Amazon S3, July 2008, and Bell Systems, January 1990

The only description of Ma Bell cascading switch failure of 1990 that I was able to google up is from the "The Hacker Crackdown" by Bruce Sterling (you have to scroll 90% down or Ctrl-F for "System 7") :
On January 15, 1990, AT&T's long-distance telephone switching system crashed.
[...]
In order to maintain the network, switches must monitor the condition of other switches -- whether they are up and running, whether they have temporarily shut down, whether they are overloaded and in need of assistance, and so forth. The new software helped control this bookkeeping function by monitoring the status calls from other switches.

[...] the switch had been programmed to monitor itself constantly for any possible damage to its data. When the switch perceived that its data had been somehow garbled, then it too would go down, for swift repairs to its software. It would signal its fellow switches not to send any more work. It would go into the fault recovery mode for four to six seconds. And then the switch would be fine again, and would send out its "OK, ready for work" signal.
However, the "OK, ready for work" signal was the very thing that had caused the switch to go down in the first place.
[...]
At approximately 2:25 p.m. EST on Monday, January 15, one of AT&T's 4ESS toll switching systems in New York City had an actual, legitimate, minor problem. It went into fault recovery routines, announced "I'm going down," then announced, "I'm back, I'm OK." And this cheery message then blasted throughout the network to many of its fellow 4ESS switches. Many of the switches, at first, completely escaped trouble. These lucky switches were not hit by the coincidence of two phone calls within a hundredth of a second. Their software did not fail -- at first. But three switches -- in Atlanta, St. Louis, and Detroit -- were unlucky, and were caught with their hands full. And they went down. And they came back up, almost immediately. And they too began to broadcast the lethal message that they, too, were "OK" again, activating the lurking software bug in yet other switches.

Here, for comparison, some excerpts from Amazon S3 Availability Event: July 20, 2008 (should rather be called "Massive Cascading Failure"):
[...] At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully.

[...] Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer's request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer requests.

[...] we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. [...] As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.