Computer World’s Patrick Thibodeau wrote an excellent article on the recent Amazon Web Services outage and quoted Uptime Institute’s Ken Brill.
Ken Brill, founder of the Uptime Institute, which researches data center issues, points to Japan’s Fukushima Nuclear Power Plant. For 40 years, there were no problems at the plant. Then an earthquake and tsunami that hit in March disabled the facility with catastrophic consequences.
Brill expects that a post-mortem on the nuclear plant will show at least 10 things that could have been done to help avoid that failure and reduce the magnitude of damage and would have made it easier or faster to recover from.
The Amazon post-mortem will likely show something similar, said Brill.
Despite the redundancies and backups built into the Amazon cloud, “you hit a combination of events for which the backups don’t work,” he said.
Users see the promise of cloud technology as a way to reduce costs and be greener, but “that [also] means concentrating processing in fewer, bigger places,” said Brill. Thus, when something goes wrong, “it has a bigger impact.”
Meanwhile, the promise of reliable cloud uptime is putting protection advocates — the IT people who champion more internal reliability and safeguards — at a disadvantage, he added. “There will always be an advocate for how it can be done cheaper, [but] if you haven’t had a failure for five years — who is the advocate for reliability?
“My prediction is that in the years ahead, we will see more failures than we have been seeing, because people have forgotten what we had to do to get to where we are,” Brill added.
For more info on this topic, attend Ken’s presentation at Uptime Symposium — Creating and Managing High Reliability Organization.
As technology becomes more pervasive, as globalization and standardization occurs, as the speed of change increases, and as cost cutting invisibly reduces safety margins, the world is apparently experienced an increasing frequency and an increasing impact of man-made disasters. This session will explore how equipment, people, local workplace factors, organizational culture, and latent conditions all interact to produce avoidable failures. Recent nuclear plant meltdowns in Japan will be used to illustrate how latent conditions can lie safely dormant for many years and then can unexpectedly combine with “normal” defense breeches to cause catastrophe beyond imagination. Research has now shown that many common solutions to “human error” may actually dysfunctional producing fewer, but much bigger disasters. Forty-five hundred Abnormal Incident Reports (AIRs) collected by the Uptime Institute over 16 years will be used to illustrate how these ideas apply to data center malfunctions.