Sei sulla pagina 1di 7

Top 14 Considerations for Addressing

Data Center Facilities Management Risks


Meeting operational challenges in the data center requires
organization, planning, and focus
By Stephen Burgess

Data center facilities managers face enormous pressure every day. The challenge of operating a complex and
ever-changing facility is considerable, given the increasing business demands and budget pressures prevalent
in the industry. Yet, successful data center facilities managers continue to meet the constant challenge. Uptime
Institute has compiled these 14 considerations for reducing risk that data center facilities managers can
embrace to identify and minimize problems affecting operations.

1. Overtime
Sustained overtime rates of 10% or more can produce chronically overworked and fatigued facility personnel,
which correlates very strongly to increased rates of incidents, including outages and even serious injury or loss
of life. Staffing a facility properly and correctly aligning workloads to the real needs of the facility is the best
way to eliminate chronic fatigue-inducing overtime, maximize personnel safety, and minimize the potential for
outages, as the vast majority of them are due to operator error.

Trying to save money by running a facility with a very lean staff is potentially the most dangerous, risky, and
worst decision any data center owner can make, because the cost of facility operations is so low relative to the
cost of the facility and that of the IT asset it supports.

2. Critical Spares
Ensuring the ready availability of spare parts in the event of a loss of infrastructure redundancy or availability is
essential for mission-critical facilities such as data centers. These critical spare parts can either be stored onsite
or can be provided by vendors. When there is vendor dependency, due diligence should be applied to ensure the
availability of the required spare parts. This can be stipulated in the service and maintenance contracts service
level agreements (SLA) held by the data center owner.

Developing a comprehensive critical spares inventory starts with a single point of failure analysis of the
data center design. Most high-quality data centers actually do not have high-impact single points of failure,

51
TH E U P T IM E IN ST ITU TE JO U RNAL

so identifying which failures would reduce redundancy is what is really important. For example, if an
uninterruptible power supply (UPS) module unexpectedly goes into static bypass, there is typically no loss
of critical load; rather, the impact is a loss of redundancy of fully conditioned, battery-backed power. The
availability of a UPS critical spares kit will dramatically reduce the time spent at the reduced redundancy level.

An effective critical spares kit should include large circuit breakers or automatic/manual/static transfer
switches (ATS/MTS/STS). With breakers in particular, the need for a critical spare often manifests itself during
scheduled maintenance, such as during tri-annual or 5-year maintenance and inspections that use primary
injection testing. This is particularly important in older facilities, where a large, expensive breaker or an ATS/
MTS/STS may be difficult to procure, if it can be located at all.

Finally, inventory management is essential to maintaining any critical spares kit. The inventory should include
a detailed asset list with robust controls for ensuring parts readiness (such as breaker certifications) and timely
replenishment when any part is taken from the inventory.

3. Diesel Fuel
The availability and quality of diesel fuel is always a concern. Facility managers must consider these issues:

• Suppliers. For most data centers, formal contracts should be in place with at least three local suppliers.
These contracts must have well-defined SLAs to guarantee fuel delivery quantity and time minimums.

• Certificate of Conformance. Every fuel supplier should be required to maintain and comply with a Certificate
of Conformance to ensure any fuel delivered conforms to the ASTM D975 standard. Additional language
should be included that forbids any contamination with biofuels (see Biofuels Putting Generators,
Availability at Risk p. 35).

• Fuel quality and polishing. Since most fuel must be stored on site for a very long time (even decades),
fuel quality must be maintained. Data center managers should regularly polish the fuel (multi-stage filtering
and circulation) test it, remove water, and manage additives. An independent laboratory that specializes in
diesel fuel should be retained to address phenomena such as stratification and self-polymerization, unless
this service is included in generator service and maintenance contracts. Fuel quality can be maintained by

• Using it

• Polishing it with permanently installed systems

• Having a vendor polish it with an on-site visit no less frequently than annually

• Having a vendor remove the fuel and replace it with fresh fuel conforming to ASTM D975.

• Acceptance testing. Any fuel received on site should be sampled with tools such as a “bacon bomb,” taking
samples at several depths once the fuel truck has been parked for at least 15 minutes to let the fuel adequately
settle. Classic tests such as the visual beaker test (“bright and clear”) should be performed for any fuel
delivery. This scrutiny, along with regular lab samples sent to an independent laboratory and the fuel
vendors’ Certificates of Conformance, will help ensure contaminant-free and chemically correct
fuel is delivered.

• Correct fuel filter size. The simplest of errors—incorrect filtration specification (micron size) of fuel filters—
have caused some of the most dramatic data center failures. Data center managers should load bank the
engine generators at 100% load to prevent fuel starvation after changing fuel filters. This also validates the
quality of the fuel.

4. Emergency Operating Procedures (EOPs)


EOPs should be developed for the ten most likely and high-impact abnormal conditions. These are pre-approved,
fully scripted responses for abnormal high-impact conditions that could reasonably occur.

52
TO P 14 CO NS ID ERATI ON S FOR A DDR ESSI N G DATA CEN TER FACI LI TI ES MA N AGEM EN T RISKS

Most modern data centers do not actually require a physical response to unexpected abnormal conditions. The real
purpose of the EOP is to verify the condition of the facility and correctly escalate and report it. The other essential
purpose of a well-developed EOP library is to ensure that facility operators do not try to be heroes, which often makes
matters worse and can endanger personnel. Training personnel to follow EOPs helps prevent the hero response.

Eight essential EOPs include:

• Loss of municipal power

• Loss of municipal water

• Activation of the fire alarm, including sustained level three detection, charged pipes, or a dry agent dump event

• Recovery from emergency power off (EPO) activation

• Loss of controls/PLC (programmable logic control) or automation of either mechanical or electrical systems

• Loss of chilled water flow

• Generator failure to start

• UPS in static bypass

5. Drills
Having well-written EOPs means little if facility personnel are not familiar with them. The best way to maintain
a high level of operational readiness is to regularly simulate all the scenarios addressed by the site’s EOP library.
These simulations are usually referred to as site drills. The more realistic the site drills the better. Site drills are
important refresher training that should be conducted in any data center.

In a live data center, there is usually very limited opportunity, if any, to replicate the actual infrastructure
conditions that warrant the use of EOPs. Many data center owners are uncomfortable at the notion of abruptly
disconnecting pumps, chillers, computer room air conditioners, and other equipment to trigger authentic
building management system (BMS) alarms and require the personnel to interpret them and exercise the
appropriate EOPs, with just a few exceptions such as scheduled pull-the-plug (PTP) tests.

Given this limitation, effective drilling requires the use of visual aids and props to safely simulate abnormal
conditions or behavior of real infrastructure. For example, a combination of printouts of BMS/emergency power
management system (EPMS) graphics, switchgear enunciators, and human machine interface (HMI) screens, with
various signs and markings that can be taped to computer screens, panel boards, and equipment can help simulate
abnormal conditions that are anticipated by the site’s EOP library.

The operations team should drill using the actual procedures in use at the facility. This produces a detailed historical
document that accurately measures the performance of the drill. Any drill conducted should produce one or more
completely filled out EOPs for the scenario. These documents should be filed and retained as formal site training.

Scheduling and performing formal site drills must consider any scheduled maintenance activity, meaning it needs
full visibility to data center operations management and approval by the formal change management process and
policies established to control all activities in the data center facility environment.

6. A Procedure-Based Control Methodology


Any and all interaction with data center facility infrastructure should be done according to pre-approved,
detailed, and fully vetted procedures. These include:

• Methods of procedure (MOP). A detailed and scripted activity for formally scheduled and approved
preventive and corrective maintenance activities. MOPs ideally capture all details about the purpose of

53
TH E U P T IM E IN ST ITU TE JO U RNAL

the maintenance and everyone involved with it. A good MOP has very detailed steps to complete the activity,
including time stamps, initial blocks, and signature fields.

• Standard operating procedures (SOP). Any routine interaction that involves a basic change of state or
configuration of the infrastructure, often to support planned maintenance, should be controlled with a well-
written SOP. SOPs share many features of the MOP, such as time stamp and operator-annotated steps.

Many data centers require procedure libraries that include hundreds of documents. Such a large collection
of documents requires a formal policy that defines how these documents are written, reviewed, and formally
approved for use. These policies should also address revision and formatting processes and controls.

Finally, SOPs and MOPs are meaningless if they are not followed. Procedure deviation is a major cause of
incidents and outages. Experienced facility technicians can become cavalier and complacent, especially with the
repetition of large maintenance evolutions. Therefore, it is crucial that management enforce strict adherence to
the steps in all procedures and provide training to ensure the procedures are understood.

7. Safety Program
Any facility or portfolio must have a local authority having jurisdiction (AHJ/LAHJ) compliant safety program.
Having a current NFPA 70E compliant program is especially important for data centers (OSHA defers to NFPA
70E for electrical safety in the workplace). This includes a complete, fully tested personal protective equipment
(PPE) kit and associated lockout-tagout (LOTO) kit for hazardous energy isolation from both electrical and
mechanical sources. Fully formalized and AHJ/LAHJ compliant safety programs entail writing various program
definitions, policies, and procedures that explicitly define how safety is managed and administered for the facility.

8. Short Circuit Coordination Study SCCS and Arc-Flash Assessment


A facility must have a current SCCS and associated arc-flash hazard assessment with arc-flash stickers correctly
placed in all areas of the environment. All breakers must be verified to have trip unit settings set to those
recommended in the SCCS.

9. Battery Monitoring System


Analogous to fuel for engine generators, having a UPS means nothing if the batteries do not respond when the
UPS input voltage goes away or out of tolerance (loss of city power or severe power quality problems). Using
a battery monitoring system that gives real-time condition and predictive maintenance capabilities with
associated alarming is the best way to achieve full confidence in the UPS batteries. If no battery monitoring
system exists, then quarterly battery inspections should be performed with industry standard tools. This is
especially important for valve regulated lead-acid (VRLA) absorbent glass mat (AGM) batteries because the cells
usually fail open. One open cell in a 40-jar string renders the whole string useless.

Real-time data provided by contemporary battery systems not only validates the availability of the battery plant, it
allows very accurate measurement of its capacity and expected reasonable end-of-life replacement period, typically
extending VRLA-type battery retention by 25% or more. Such an extension of battery utilization amounts to a very
significant operational cost deferral given the multimillion-dollar value of many data center battery installations.

Battery spares should be purchased from the same batch as the battery installation and should be kept in
the same environmental and charging conditions as those connected to the UPS itself, so that the spares age
and degrade at the same rate as the batteries in use. In this way, when a battery develops unacceptably high
internal resistance and must be changed, the replacement battery has very similar or nearly identical functional
characteristics as the other batteries in the string. This ensures no upset or imbalance to the charging voltage
applies to the other batteries in the string.

The battery monitoring system should ideally be extended to the spare UPS batteries and the batteries used for
starting engine generators. Using real-time, condition-based maintenance rather than the time-interval replacement
that is common produces confidence in these batteries. Using a battery monitoring system for such components
generates reliable expectations from them, results in their maximum utilization, and reduces maintenance.

54
TO P 14 CO NS ID ERATI ON S FOR A DDR ESSI N G DATA CEN TER FACI LI TI ES MA N AGEM EN T RISKS

Deploying a high-quality battery management system does not preclude physical inspections of the battery
plant, which should include visual checks on all battery connections and connector fastener torque checking. A
combination of a battery monitoring system and periodic physical inspection will ensure the maximum reliable
utilization of a data center’s battery plant.

10. Training
Training is a complicated topic that can cover many components and activities. The only formal training
curriculum in many data centers relates to corporate compliance (how to be a company employee), not actual
facilities activities or knowledge. This is because many facilities rely on informal on-the-job training (OJT).
While this approach can be effective, it means that achieving fully qualified staff depends on a large number
of undocumented quality variables, with the quality "fully qualified" being a largely subjective determination.
Informal OJT may also be deficient in key areas because it is a largely reactive approach.

At a minimum, a formalized training program and curriculum can be divided into two main categories: operational
readiness and planned activities. Formal training includes mastering the facility’s sequence of operations (SOO) for
electrical and mechanical systems and the integrated system SOO related to how all systems work together in concert.
This training often involves studying the alarms generated by the controls, BMS, and EPMS to respond correctly
to them, often leading to the use of an EOP for critical impact alarms. Studying a facility’s SOOs and the alarms the
monitoring systems generates can enable the staff to correctly respond to any abnormal facility condition.

Formal training related to planned activities should focus on things like access control, vendor escort, and
supervision, and the use of procedures to conduct what are mostly preventive maintenance activities. Thus,
this training might include policy review, courses, and materials focused on the use of procedures, where the
approved procedures are located, how to write a procedure, the use of the change management system, the use of
the maintenance management system, the basis of the maintenance program, the navigation of the BMS/EPMS,
and other shift presence and site rounds requirements.

11. Maintenance
A high-quality maintenance program keeps equipment in like-new condition and maximizes its reliability,
performance, and lifespan. At a minimum, all major assets equipment should be maintained to original
equipment manufacturer (OEM) recommendations. Expanding maintenance considerations to include
ASHRAE, International Electrical Testing Association (NETA), National Electrical Manufacturers Association
(NEMA), Institute of Electrical and Electronics Engineers (IEEE), National Fire Protection Association
(NFPA), ASTM International, and American National Standards Institute (ANSI), design engineer
recommendations, and authorized contractor recommendations further enhances the maintenance standard of
the facility. Once fully informed, service and maintenance contracts can be configured beyond the conservative
and sometimes excessive recommendations from the OEMs.

Maintenance should be performed at the minimum intervals needed to maintain good equipment condition that
minimizes abnormal behavior and maximizes the efficiency and life of the asset, typically monthly, quarterly,
semi-annually, or annually. Many times this interval can be less frequent than OEM recommendations, which
can be overly conservative.

Since scheduled maintenance usually involves some direct manipulation of equipment, facilities should be wary
of “maintenance-induced failure,” a phenomenon associated with unnecessary interactions with equipment that
increases the potential for human error and incidents. The minimum frequency of interaction with equipment
should be the level of interaction that captures its condition and keeps the asset in like-new condition. Any
greater frequency is excessive, offers no real benefit to the equipment, consumes personnel resources, and
increases risk of incidents.

In one case, a data center with 100 large air handling units (AHUs) determined that there was no real benefit
to performing monthly or quarterly preventive maintenance inspections, so those were removed from the
maintenance calendar and replaced by enhanced semi-annual inspections that still kept the equipment in like-
new condition but greatly reduced workload and unnecessary interaction with the equipment, allowing those
resources to be better applied elsewhere in the environment.

55
TH E U P T IM E IN ST ITU TE JO U RNAL

The industry currently follows several dominant maintenance methodologies, with most plans combining
traditional condition-based maintenance, run-to-fail, and predictive maintenance. Because of their sheer
size and high levels of redundancy and resiliency, some very large data centers may find it cost effective to let
some asset classes operate until they begin to show degraded performance, at which point maintenance can
be performed to restore the normal operating condition. Such approaches have to be carefully considered in
order to ensure risk is appropriately addressed. Ultimately the goals of any maintenance plan should be the
elimination of incidents due to abnormal equipment behavior or excessive interaction with the equipment using
the most cost-effective approach.

Deferred maintenance, or skipping of maintenance due to scheduling or resource issues, must be


aggressively avoided, especially when the deferral is a consequence of pushback against intrusive
or redundancy-reducing maintenance from the IT organization. Ultimately postponing important
maintenance can be counterproductive. Any deferred maintenance should be recorded, tracked, and
communicated to IT asset stakeholders to ensure it gets appropriate managerial visibility and resolution.

Predictive maintenance programs such as infrared scanning of power distribution systems, vibration analysis
of rotating assemblies, and lubrication oil analysis are powerful ways of getting advanced warning of potential
equipment degradation. Predictive maintenance can capture potential problems early, well before they begin to
impair the performance of critical equipment. The key to predictive maintenance is creating an equipment baseline
and then trending the data being collected in order to detect unusual rates of rise for degraded condition indicators.

A well-formulated maintenance program requires a maintenance management system, or MMS. An effective MMS
contains all the asset information and the scheduling, approval, and tracking information needed to complete all
recurring and corrective maintenance activities. The MMS can be flat-file or computer based, with the primary
benefit of a computer-based (CMMS) being resource tracking and administration (staff hours and work orders
completed on time) coupled with a relational database that can quickly access all aspects of scheduled and
recorded maintenance activities. Whether computerized or not, a key requirement for any MMS is the capture
and accessibility of maintenance history per asset. This facilitates the ability to clearly trend any maintenance per
asset as well as meet SLA compliance requirements and client due diligence information requests.

12. Access Control and Vendor Supervision


Only authorized personnel should be allowed into critical infrastructure areas; therefore, some access control
policy and some type of physical system must be in place to control traffic into the facility, with measures in
place to keep access lists current and enforced. Vendors must also be screened, qualified, and supervised based
on area and activity in the facility. The standard approach to vendors is complete supervision in addition to
formal compliance with the facility’s “house rules,” or policy documents, often referred to as critical facility or
data center house rules, which list and define allowed and non-allowed activities and what do to in the case of
abnormal situations or emergencies.

13. SOO, Integrated Systems Testing (IST), and Major Switchgear Validation
Most normal, steady-state automation is continuously verified in any running data center; however, the most
important automation is often merely assumed to work. Specifically, in the event of a loss of municipal power,
many data centers are stressed in a way that hasn’t happened since the facility was originally commissioned.
Coupled with lack of preparation due to poor EOPs and failure to drill, loss of utility power can be a make or
break moment for a data center.

Maintenance oversight often overlooks the importance of preventive maintenance inspections of the
programmable logic controller (PLC) for the switchgear, which include protective relays, power quality meters,
ATS/MTS/STS programming, firmware revisions, and PLC used in generator paralleling switchgear lineups.
Additionally, operator interaction with HMI and other high-level normal mode override functions can change the
original intended configuration of the automation settings over time.

Without a regular (at least annual) PTP test, neither the automation nor the switchgear itself is validated
to perform as expected. Many data centers are averse to the PTP test, with IT departments and customers
pushing back on any such testing with the mistaken idea that such testing is not needed and exposes them to
unneeded risk.

56
TO P 14 CO NS ID ERATI ON S FOR A DDR ESSI N G DATA CEN TER FACI LI TI ES MA N AGEM EN T RISKS

In addition to regularly performing a PTP test, there are many routine checks of the PLC environment that
should be as regularly conducted as any other scheduled maintenance of major infrastructure assets.

14. Change Management


A robust change management system should be put in place for any activity that crosses pre-established level of
risk (LOR) criteria. The change management system should include a format review process based on a well-
defined LOR matrix that captures and ranks all activities that can occur at the data center. Basically, any activity
with real potential for impact on the data center must be formally scheduled and then approved by accountable
persons in the data center facilities and IT organizations before any such scheduled activities can occur. UI

Stephen Burgess is a consultant with Uptime Institute Professional Services.


Mr. Burgess reviews and assesses data center designs, facilities, and
operations leading to Tier Certification of Constructed Facility and Tier
Certification of Operational Facility and the M&O Stamp of Approval as well
as teaching the Accredited Tier Specialist course.

57

Potrebbero piacerti anche