When lightning strikes: How to reduce data centre downtime

The recent lightning strikes that caused a Google data centre in Belgium to lose data and restore some information highlights the importance to anticipate and manage critical data centre assets.

The repeated lightning strikes on the electricity grid that powers facilities in Saint-Ghislain affected five per cent of persistent or non-virtual disks in the zone that powers Google Compute Engine, its cloud computing platform.

According to Data Center Knowledge the problem was compounded when the data centre’s battery backup failed, although Google said the vast majority of the data was recovered several days after the strikes occurred.

The lightning strikes raise an important issue in the data centre fraternity and that is of data centre downtime and the costs associated with unplanned outages.

With the increase in reliance on IT systems to support business-critical applications, a single downtime event now has the potential to significantly impact the profitability of an enterprise. In fact, for enterprises with revenue models that depend on the data centre’s ability to deliver IT and networking services to customers, downtime can be particularly costly.

According to Data Center Knowledge the average cost of data center downtime across industries was approximately $7,900 per minute with the average reported incident length was 86 minutes, resulting in average cost per incident of approximately $690,200 and when you think of a company the size of Google these average costs can be multiplied many times over.

Unfortunately, data centre outages aren’t just costly, they are also quite common.

Recent research by the Ponemon Institute’s and their publication, Calculating the Cost of Data, of those data centres surveyed, 95% said they experienced one or more unplanned outages in the past 24 months.

Included in the findings by the Ponemon Institute was weather events, like the one that affected the Google facility, only account for 12% of data center outages, with UPS system failure topping the list at 29%, human error at 24%, and water, heat or Computer Room Air Conditioning (CRAC) failure at 15%.

While the Datapod modular data centre system is built to withstand lightning strikes and cyclones what can be done to reduce the incidence and recovery from, unplanned data centre outages?

Director of modular data centre maker, Datapod, Adam Smith, suggests there are nine considerations data centre managers and CIO’s can build into their data centre strategy to both help reduce the incidence of unplanned data centre downtime and the recovery from such an incident.

Nine Data Centre Considerations

Educate management and key executive decision makers: Ensure there is open and continuous dialogue with management and they understand the importance of the data centre to the overall performance of the organisation. One of the benefits of the modular Datapod data centre system is scalability. This can reduce the capital and operational cost when compared to a traditional approach which means more budget can be used to ensure uptime and reduce risk.
Evaluate risk and build accordingly: Perform a risk assessment, including a cost-benefit analysis, to determine what is reasonable to expect from providers (power and services) during power outages. Determine the costs of an outage to the business, and if required, build in redundancies to overcome the risk of an outage. For example, whether you have a traditional data centre or a modular system, the addition of a Datapod Utilitypod can build in power redundancy into your existing data centre, should there be an interruption to mains power.
Open channels with external stakeholders: Establish cooperation between power companies, internal departments, and remote service providers upon which the company is dependent. Establish a framework and working document that identifies protocols and procedures; and enables key stakeholders to implement restoration activities in a timely manner.
Implement DCIM or BMS: Use Data Centre Infrastructure Management (DCIM) or Building Management System (BMS) software with interactive 3D visualizations. With 3D visualization, you’d know within seconds what is malfunctioning. For example a failed component would be instantly identifiable in colour so you’d know exactly where they were located.
Use the latest data centre design: Utilise best practices in modular data centre design and redundancy to maximize availability. There are a number of proven best practices that serve as a good foundation for data center design and redundancy and these are available as an off the shelf option, from a modular data centre manufacturer like Datapod, including power and cooling options.
Ensure appropriate resource allocation: Dedicate appropriate resources to recovery and training in anticipation of an unplanned outage. This is more than having enough people to be able to reset systems following an outage, it involves having site preparedness – food, lodging, alternate transportation and ensuring staff training is up-to-date – in the event the outage is the result of a natural disaster. For example a major hurricane or cyclone event could cut off critical supplies and could affect access to generator fuel and critical parts.
Regular testing: Regularly test generators and switchgear to ensure emergency power in case of utility outage. Right from day one Datapod customers have peace of mind when it comes to testing. Customers are come into the factory to ensure all generators and switching gear kick-in in the case of an emergency even before their new data centre is deployed. Customers literally test drive their new data centre while it is still in the factory. Furthermore, Datapod encourages regular onsite testing. Regular testing confirms the proper operation during an outage and keeps the facility team up-to-date in their training should an unplanned outage occur.
Regular testing of UPS batteries: Having a dedicated battery monitoring system is sensible. According to Datapod research battery failure in traditional data centres is the leading cause of UPS system loss of power. Utilizing a predictive battery monitoring method can provide early notification of potential battery failure.
Regular testing of processes and procedures: If human error accounts for 24% of downtime, anything that you can do to familiarize your staff and contractors with the right steps to take during a critical event will be a good investment. Of course the more standardized your environment the less likely human error will be a factor, whereas in highly customized facilities there will be a greater incidence of human error related failures.

The Datapod modular data center system is built to meet the toughest standards, the toughest conditions and is certified for Quality by SAI Global.

For more information about the Datapod modular data centre solution download the Datapod White Paper.

You can also pre-register for the Datapod Data Centre Efficiency and Sustainability White Paper.

NB: Lightning strike image: “Lightnings sequence 2 animation” by original data: Sebastien D’ARCO, animate: Koba-chan – original source is Image:Lightnings sequence 2.jpg, animated by me.. Licensed under CC BY-SA 2.5 via Wikimedia Commons

When lightning strikes: How to reduce data centre downtime

Unfortunately, data centre outages aren’t just costly, they are also quite common.

Nine Data Centre Considerations

Datapod

Subscribe to our mailings

Datapod

Subscribe to our mailings