Abstract

Cooling failure in data centers (DCs) is a complex phenomenon due to the many interactions between the cooling infrastructure and the information technology equipment (IT). To fully understand it, a system integration philosophy is vital to the testing and design of experiment. In this paper, a facility-level DC cooling failure experiment is run and analyzed. An airside cooling failure is introduced to the facility during two different cooling set points as well as in open and contained environments. Quantitative instrumentation includes pressure differentials, tile airflow, external contour and discrete air inlet temperature, intelligent platform management interface (IPMI), and cooling system data during failure recovery. Qualitative measurements include infrared imaging and airflow visualization via smoke trace. To our knowledge of current literature, this is the first experimental study in which an actual multi-aisle facility cooling failure is run with real IT (compute, network, and storage) load in the white space. This will establish a link between variations from the facility to the central processing unit (CPU). The results show that using the external IT inlet temperature sensors, the containment configuration shows a longer available uptime (AU) during failure. However, the IPMI data show the opposite. In fact, the available uptime is reduced significantly when the external sensors are compared to internal IT analytics. The response of the IT power, CPU temperature, and fan speed shows higher values during the containment failure. This occurs because of the instantaneous formation of external impedances in the containment during failure, which renders the contained aisle to be less resilient than the open aisle. The tradeoffs between PUE, OPEX, and AU are also explained.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call