Abstract
As recent heterogeneous systems comprise multi-core CPUs and multiple GPUs, efficient allocation of multiple data-parallel applications has become a primary goal to achieve both maximum total performance and efficiency. However, the efficient orchestration of multiple applications is highly challenging because a detailed runtime status such as expected remaining time and available memory size of each computing device is hidden. To solve these problems, we propose a dynamic data-parallel application allocation framework called ADAMS. Evaluations show that our framework improves the average total execution device time by 1.85× over the round-robin policy in the non-shared-memory system with small data set.
Highlights
High performance and energy efficiency are critical parameters for emerging applications such as vision and various machine-learning applications [1,2,3,4,5]
Though the execution time estimation is nearly impossible for general programs, recent studies have shown that the execution time of general-purpose GPU (GPGPU) tasks are fairly predictable [9,10,11] based on input problem sizes; we decided to use a problem size-based regression model for execution time prediction, similar to the approach of MKMD [10] based on offline profile data
We propose a dynamic multiple data-parallel application allocation framework (ADAMS), to efficiently allocating multiple processes to multiple devices
Summary
High performance and energy efficiency are critical parameters for emerging applications such as vision and various machine-learning applications [1,2,3,4,5]. In this situation, load balancing failures among multiple devices result in resource underuse of some devices, and the maximum performance of the system cannot be achieved. Recent GPU evolution trends (NVIDIA Pascal [16] and Volta [17]) improve both the throughput and latency by allowing concurrent execution of multiple kernels using preemptive [18]/spatial multitasking [19] Efficient memory management has become crucial for achieving better multitasking performance because concurrent kernel execution requires more memory to handle all co-running kernels To address this issue, we introduce an automatic device allocation management system (ADAMS). Shared memory includes the allocated application list and remaining total execution time per device
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have