Amazon outage indicative of more cloud failures on the horizon


Computer World’s Patrick Thibodeau wrote an excellent article on the recent Amazon Web Services outage and quoted Uptime Institute’s Ken Brill.

Ken Brill, founder of the Uptime Institute, which researches data center issues, points to Japan’s Fukushima Nuclear Power Plant. For 40 years, there were no problems at the plant. Then an earthquake and tsunami that hit in March disabled the facility with catastrophic consequences.

Brill expects that a post-mortem on the nuclear plant will show at least 10 things that could have been done to help avoid that failure and reduce the magnitude of damage and would have made it easier or faster to recover from.

The Amazon post-mortem will likely show something similar, said Brill.

Despite the redundancies and backups built into the Amazon cloud, “you hit a combination of events for which the backups don’t work,” he said.

Users see the promise of cloud technology as a way to reduce costs and be greener, but “that [also] means concentrating processing in fewer, bigger places,” said Brill. Thus, when something goes wrong, “it has a bigger impact.”

Meanwhile, the promise of reliable cloud uptime is putting protection advocates — the IT people who champion more internal reliability and safeguards — at a disadvantage, he added. “There will always be an advocate for how it can be done cheaper, [but] if you haven’t had a failure for five years — who is the advocate for reliability?

“My prediction is that in the years ahead, we will see more failures than we have been seeing, because people have forgotten what we had to do to get to where we are,” Brill added.

For more info on this topic, attend Ken’s presentation at Uptime Symposium — Creating and Managing High Reliability Organization.

As technology becomes more pervasive, as globalization and standardization occurs, as the speed of change increases, and as cost cutting invisibly reduces safety margins, the world is apparently experienced an increasing frequency and an increasing impact of man-made disasters. This session will explore how equipment, people, local workplace factors, organizational culture, and latent conditions all interact to produce avoidable failures. Recent nuclear plant meltdowns in Japan will be used to illustrate how latent conditions can lie safely dormant for many years and then can unexpectedly combine with “normal” defense breeches to cause catastrophe beyond imagination. Research has now shown that many common solutions to “human error” may actually dysfunctional producing fewer, but much bigger disasters. Forty-five hundred Abnormal Incident Reports (AIRs) collected by the Uptime Institute over 16 years will be used to illustrate how these ideas apply to data center malfunctions.

Posted by mstansberry on 27-04-2011
Categories: Uncategorized
Tags: , , , ,

Data center infrastructure for cloud computing providers


In this video Q&A with Emerson Network Power/Liebert VP Peter Panfil, we discuss how public cloud computing providers are driving Liebert’s product design toward more flexible and efficient hardware. We also discuss Emerson Liebert’s private cloud in its new data center in St. Louis. Panfil will be presenting at Uptime Symposium with customer and public cloud computing provider, Raging Wire.

Session info here: Everybody now agrees that cloud services are going to play a very large part in the future of IT. It’s much less clear, however, what this actually means for the physical infrastructure of the data center that is delivering these services. For example, some evidence suggests that virtualized and cloud workloads can introduce much greater volatility in terms of energy consumption, power use, and availability. Furthermore, cloud providers can experience periods of dramatic and unexpected growth. And it may equally be that customers are less “sticky”, moving their workloads out when prices are right, simply because they can. What does this mean for data center designs? Which best practices should be observed in order to experience the benefits of cloud adoption while mitigating potential risks?

Click here to register for Symposium now.

Posted by mstansberry on 27-04-2011
Categories: Uptime Institute Symposium
Tags: , , ,

Technical Response to Facebook’s Open Compute Project data center plans


Earlier this month, social media giant Facebook unveiled details of its data center facilities designs and server hardware plans in what it is calling the Open Compute Project, posting technical specifications and CAD drawings of its custom server hardware and data center MEP components.

The process is groundbreaking, and Facebook should be lauded for its openness and generosity. In the Web-scale data center space, secrecy has been the norm. Facebook’s move will have significant ramifications for the entire data center community – especially if it inspires other highly-efficient data center operators to follow suit.

Facebook’s data center operations have faced unprecedented public scrutiny, with environmental organizations protesting the company’s choice of coal-powered electricity. Facebook’s IT operations touch so many of our lives, it is not surprising that its data centers are of major public interest.

Jonathan Heiliger, VP Technical Operations at Facebook, asked the data center community to weigh in on the designs in a recent video. “Give us feedback, tell us where we screwed up, tell us where we made a bad decision, and help us make it better.”

In the spirit of that request, Uptime Institute Professional Services engineers offer the following feedback:

Facebook’s cooling method water-wasteful in a desert community

“As an engineer from the water-starved west, this is near and dear to me,” said Keith Klesner, consultant with Uptime Institute Professional Services. “The climate in the area is a high desert with average annual precipitation of less than 10 inches. In a region where water is scarce, Facebook has designed the data Center with 100% evaporative free cooling. The local municipality sources all of its water from a shallow aquifer, most likely the same one which Facebook has sunk its wells.”
From Facebook: The direct evaporative system is supplied primarily by an on-site well and secondarily by the normal city water distribution system. Both sources feed into a storage tank. The storage tank provides 48 hours of water in the event well water and city water sources are unavailable.

“For a site considering sustainability and overall corporate social responsibility, my grade for the cooling choice is a D,” Klesner said. “The new Bend Broadband data center down the road in Bend, Oregon is a more sustainable model (using indirect air side economization) given the local climatology. This thread on Facebook’s own pages hits on my exact point. The City of Prineville is small and running out of water. Facebook is working with the City, but aquifers do not often recharge at the rate of extraction.”

“Phase 1 of the Data Center is 30 MW and Ph 2 is TBD. I think a starting consumption estimate could be 10,000 gallons per MW/day putting total water consumption of 300,000 gallons per day. That’s about 10% of the total city water, which will rise significantly on phase 2 of the project. The designer has the exact volume calculations, but the sourcing issue is the heart of the matter. The City of Pineville will run out of water from current sources somewhere from 2015-2017. Their solution will be to drill to a deeper aquifer which will likely be subject to overuse in the future,” Klesner said.

Facebook data centers vulnerable to downtime

“Wildfires, dust and volcano ash happen,” Klesner said. “In the case of extreme outdoor contaminants the data center will shut down.”

From Facebook: We acknowledge that this is a condition that can cause potential shutdown. We already have filtration installed and will run evaporative cooling at full capacity to reduce smoke and particulates in the event of a fire or contamination. Then, depending on intensity, we can utilize time for orderly shutdown, or else run for a prolonged period of time at minimum OA. We have a provision for a closed-loop system that uses indirect cooling.

The high desert east of the Cascades Mountains burns every summer. It’s only a matter of time before Facebook has to deal with this issue.

Facebook has said that the Uptime Tier Classification System does not apply to their Prineville data center. But, you would think the organization might be less cavalier about potentially disruptive vulnerabilities at the facility that supports the primary line of business.

In fact, the details of the Facebook data center design emphasize just how effective Tiers are at rating data center investment in term s of performance potential. Some of the facilities details reveal a fairly typical cost-focused rather than performance-minded data center design.

For example, Facebook’s backup generators are a potential vulnerability. “The document states the engine-generators are Standby rated,” Uptime Institute Professional Service consultant Christopher Brown pointed out. “This will impact the ability of the units to support the facility for long-term power outages as the Standby rating has yearly runtime limitations. The engine-generators typically are used for reliable power supply when performing UPS maintenance. Regular testing of the units, maintenance of other critical equipment may impact the units’ ability to support a long term power outage or long term failure of a UPS system.”

Lastly, much of the mechanical infrastructure does not lend itself to Concurrent Maintainability. “The large bus duct (1000 amps and above) are generally constructed with bolt together sections and thus allow for maintaining of the bus sections. But smaller bus duct to deliver power to the servers does not typically utilize bolt together sections and instead uses press fit connections. These connections are not maintainable and thus create an operational problem long term,” Brown said.

On the facilities side, inconsistent maintenance opportunities on select and constrained performance potential in their engine generators yields an overall Tier II rating. These are fundamental constraints that will impact long-term operations. It is important to go to the heart of the Tiers: business case.

The key takeaways from this analysis:
-Working backwards from the facilities design, Facebook’s IT operations at its Prineville, OR data center may be core to its business, but the company is willing to tolerate downtime.
-While Facebook’s Prineville data center is energy efficient, it has a long way to go to call itself green.

“The term ‘green’ cannot just be about reducing electrical power consumption. It has to involve the natural resource limitations of the local area. Green must be centered on designing data centers that minimize the consumption of all natural resources not just one,” Brown said. “Any green approach should be designed to minimize energy consumption while not increasing strain on other vital resources. Otherwise we trade one problem for another”

Continue the dialogue at Uptime Symposium
Facebook’s data center operations team will give the keynote address Wednesday May 11th at Uptime Institute Symposium, with a presentation: Facebook’s Latest Innovations in Data Center Design, featuring Facebook’s Jay Park, Director, Data Center Design Engineering and Facilities Operations, Thomas Furlong, Director of Site Operations, and Daniel Lee, Data Center Mechanical Engineer.

Posted by mstansberry on 18-04-2011
Categories: Uncategorized
Tags: , ,

Map of Uptime Institute Tier Certified Data Centers


Did you know that Uptime Institute has awarded Tier Certifications to data center owners and operators in 19 countries worldwide? Check out this map of Tier Certification Owners. The full list is here.

Posted by mstansberry on 15-04-2011
Categories: Uncategorized
Tags: ,

Green data center case study winners announced, present at Symposium


Uptime Institute has just announced the winners of this year’s Green Enterprise IT (GEIT) Awards, which recognize outstanding projects for improved energy and resource productivity in IT and data center operations. There are awards in several categories, to recognize different kinds of innovation.

This year’s winners, finalists and categories are listed on the Symposium website. The winners will present their case studies at Symposium, May 9-12, and finalists will be on panels discussing their projects.

“We received a record number of entries for the 2011 GEIT Awards from companies across the globe doing remarkable things to lead the charge in making the data center industry more efficient,” said Andy Lawrence, Program Director of the Uptime Institute Symposium and Research Director for Eco-Efficient IT, The 451 Group. “Uptime Institute’s GEIT Awards program educates the data center and IT industries on new initiatives to effectively reduce energy consumption by highlighting the cutting-edge achievements of large enterprises, as well as smaller organizations.”

Uptime Institute congratulates all GEIT Award winners and finalists.

Posted by mstansberry on 14-04-2011
Categories: Uptime Institute Symposium
Tags: , , ,