Performance, Energy, and Scalability Analysis and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks

Xingfu Wu,Justin M Wozniak,Rick Stevens,Thomas Brettin,Valerie Taylor,Fangfang Xia

doi:10.1145/3337821.3337905

Abstract

Training scientific deep learning models requires the significant compute power of high-performance computing systems. In this paper, we analyze the performance characteristics of the benchmarks from the exploratory research project CANDLE (Cancer Distributed Learning Environment) with a focus on the hyperparameters epochs, batch sizes, and learning rates. We present the parallel methodology that uses the distributed deep learning framework Horovod to parallelize the CANDLE benchmarks. We then use scaling strategies for both epochs and batch size with linear learning rate scaling to investigate how they impact the execution time and accuracy as well as the power, energy, and scalability of the parallel CANDLE benchmarks under conditions of strong scaling and weak scaling on the IBM Power9 heterogeneous system Summit at Oak Ridge National Laboratory and the Cray XC40 Theta at Argonne National Laboratory. This study provides insights into how to set the proper numbers of epochs, batch sizes, and compute resources for these benchmarks to preserve the high accuracy and to reduce the execution time of the benchmarks. We identify the data-loading performance bottleneck and then improve the performance and energy for better scalability. Results with the modified benchmarks on Summit indicate up to 78.25% in performance improvement and up to 78% in energy saving under strong scaling on up to 384 GPUs, and up to 79.5% in performance improvement and up to 77.11% in energy saving under weak scaling on up to 3,072 GPUs. On Theta, we achieve up to 45.22% performance improvement and up to 41.78% in energy saving under strong scaling on up to 384 nodes. Moreover, the modification dramatically reduces the broadcast overhead.

Full Text