Posted by mstansberry | Posted in Data center availability, Data center media | Posted on 16-06-2011
An excellent article from SearchDataCenter.com’s data center advisory board, The causes and costs of data center system downtime, featured Uptime Institute VP, Rick Schuknecht.
Schuknecht leads Uptime Institute’s elite data center end user network. From the article:
Schuknecht said 73% of data center downtime is caused by human error. Human error includes poor training, poor maintenance practices and poor operational governance. He said an outage can be very stressful and damaging to morale, because jobs and compensation are often based on an organization’s availability goals.
Schuknecht also said that if an organization has a good investigation protocol in place, they can determine the root cause of the outage and identify steps to take in the short and long term. But that only works if you have an effective protocol in place.
There are some overlooked repercussions to an outage. For example, there is a regulatory penalty in financial industries. An outage can also erode a company’s competitive edge, like loss of business reputation within the industry and/or customer base. Where would you rather put your money? In the bank with no downtime or the one with repeated downtime? Most financial companies have processes in place to preserve or recover data; it’s the loss of transactional continuity that can cause the biggest problems.
What can data center staff do to avoid and mitigate system downtime? Schuknecht recommends establishing a good facilities and computing maintenance program for each piece of equipment, creating a staff training program that describes how and when to respond to downtime events, provide adequate funding levels for operating expenses to make sure everything works properly and institute a good governance program where site infrastructure is operated in accordance with manufacturer expectations.
Posted by mstansberry | Posted in Data center availability, Data center colocation | Posted on 02-05-2011
Data Center Dynamics reported last week, Aruba, Italy’s largest web hosting data center, went down due to a fire in the UPS room involving the batteries.
“Battery fires happen,” said Uptime Institute Professional Services Consultant, Chris Brown. “Once the fire starts the battery can feed the fire until it exhausts its energy. More information on this particular incident would be needed to know if this was or was not an avoidable situation (i.e. if it was the result of a thermal runaway or some other issue). But thermal runaway would be the biggest concern I would have for a cause of a battery fire.”
There are ways to help avoid thermal runaway, according to Brown. Those would include but not be limited to keeping the batteries and charging means (UPS) in good condition and repaired, a battery monitoring system that monitors the cell temperature of each battery, and temperature compensated charging.
“Basically the best way to avoid battery issues is to stay on top of the preventative maintenance of the batteries and charging means. Regular preventative maintenance can spot problematic batteries or cells before they fail internally that can lead to a thermal runaway as well as allow technicians to adjust charging voltage and current to ensure the batteries are not overcharged,” Brown said. “Batteries are combustible there is always a risk of fire from batteries. And that risk should drive where batteries are placed and the type of fire extinguishing means used for the room.”
Terral Altom, Uptime Institute Professional Services consultant said batteries can and often do build up heat and hydrogen, and under the right circumstances, a fire can erupt. “The trouble with wet cell batteries is that they have plastic jars, and these jars are highly combustible.”
Some insurance underwriters require a sprinkler system in battery rooms. In an anecdotal account told to Altom, a flooded cell battery room caught fire, and the gaseous suppression system discharged. The smoldering battery jars reignited after the suppressant gas dissipated. The data center team then had to call the fire department, and by the time they got there, it took 45 minutes to extinguish the fire. Due to smoke and water damage, much more than the battery rooms were damaged.