BARM: A Batch-Aware Resource Manager for Boosting Multiple Neural Networks Inference on GPUs With Memory Oversubscription

Zhao-Wei Qiu,Ya-Shu Chen,Kun-Sheng Liu

doi:10.1109/tpds.2022.3199806

Abstract

Modern intelligent devices usually execute multiple neural networks to improve service quality. However, system performance degrades significantly when the working set exceeds the physical memory capability, a phenomenon called memory oversubscription. To support the execution of multiple independent neural networks with limited physical memory, this article explores resource management in GPUs with unified virtual memory and demand paging. We first analyze the relationship between the simultaneous execution of multiple neural networks from streaming multiprocessors (SM) assignment and page fault overhead from memory thrashing. To boost performance by reducing the page fault penalty, we propose a batch-aware resource management approach, BARM, including (1) batch-aware SM resource allocation to increase the batch size and (2) thrashing-preventing memory allocation to eliminate run-time thrashing. The performance of the proposed method was evaluated using a series of workloads, and response latency is reduced significantly over the state-of-the-art page fault prefetcher and batch-aware TLP management. The proposed framework was also implemented on the real platform and evaluated by a case study, and impressive results were obtained.

Full Text