An analysis of GPU utilization trends on the Keeneland initial delivery system

Tabitha K Samuel,Stephen Mcnally,John Wynkoop

doi:10.1145/2335755.2335793

Abstract

In late 2010, The Georgia Institute of Technology along with its partners - the Oak Ridge National Lab, the University of Tennessee-Knoxville, and the National Institute for Computational Sciences, deployed the Keeneland Initial Delivery System (KIDS) - a 201 Teraflop, 120node HP SL390 system with 240 Intel Xeon CPUs and 360 NVIDIA Fermi graphics processors as a part of the Keeneland Project. The Keeneland Project is a five-year Track 2D cooperative agreement awarded by the National Science Foundation (NSF) in 2009 for the deployment of an innovative high performance computing system in order to bring emerging architectures to the open science community, KIDS is being used to develop programming tools and libraries in order to ensure that the project can productively accelerate important scientific and engineering applications.Until late 2011, there was no formal mechanism in place for quantifying the efficiency of GPU usage on the Keeneland system because most applications did not have the appropriate administrative tools and vendor support. GPU administration has largely been an afterthought as vendors in this space are focused on gaming and video applications. There is a compelling need to monitor GPU utilization on Keeneland for the purposes of proper system administration and future planning for Keeneland Final System, which is expected to be in production in July 2012.With the release of CUDA 4.1, NVIDIA added enhanced functionality to the nvidia-system management interface (nvidia-smi) tool, which is a management and monitoring command line utility that leverages the NVIDIA Management Library (NVML). NVML is a C-based API for monitoring and managing various states of the NVIDIA GPU devices. It provides a direct access to the queries and commands exposed via nvidia-smi. Using nvidia-smi, a monitoring tool was built for KIDS, to monitor utilization and memory usage on the GPUs.In this paper, we discuss the development of the GPU Utilization tool in depth, and its implementation details on KIDS. We also provide an analysis of the utilization statistics generated by this tool. For example, we identify utilization trends across jobs submitted on KIDS - such as overall GPU utilization as compared to CPU utilization. We also examine GPU utilization from the perspective of software - which packages are most frequently used, and how do they compare with respect to GPU utilization and memory usage.Collection and analysis of this data is essential for facilitating heterogeneous computing on the Keeneland Initial Delivery System. Future direction for the usage of these statistics is to provide insights on overall usage of the system, determine appropriate ratios for jobs (CPU to GPU, GPU to host memory), assist in scheduling policy management, and determine software utilization.These statistics become even more relevant as the center prepares for the deployment of the Keeneland Final System. As heterogeneous computing appears to be more and more common, and is quickly becoming the standard, this information will help greatly in delivering consistent high uptime and assist software developers in writing more efficient code for the majority of the codebases aimed at heterogeneous systems.

Full Text