Abstract
Modern intelligent devices usually execute multiple neural networks to improve service quality. However, system performance degrades significantly when the working set exceeds the physical memory capability, a phenomenon called memory oversubscription. To support the execution of multiple independent neural networks with limited physical memory, this article explores resource management in GPUs with unified virtual memory and demand paging. We first analyze the relationship between the simultaneous execution of multiple neural networks from streaming multiprocessors (SM) assignment and page fault overhead from memory thrashing. To boost performance by reducing the page fault penalty, we propose a batch-aware resource management approach, BARM, including (1) batch-aware SM resource allocation to increase the batch size and (2) thrashing-preventing memory allocation to eliminate run-time thrashing. The performance of the proposed method was evaluated using a series of workloads, and response latency is reduced significantly over the state-of-the-art page fault prefetcher and batch-aware TLP management. The proposed framework was also implemented on the real platform and evaluated by a case study, and impressive results were obtained.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Parallel and Distributed Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.