Abstract

As recent heterogeneous systems comprise multi-core CPUs and multiple GPUs, efficient allocation of multiple data-parallel applications has become a primary goal to achieve both maximum total performance and efficiency. However, the efficient orchestration of multiple applications is highly challenging because a detailed runtime status such as expected remaining time and available memory size of each computing device is hidden. To solve these problems, we propose a dynamic data-parallel application allocation framework called ADAMS. Evaluations show that our framework improves the average total execution device time by 1.85× over the round-robin policy in the non-shared-memory system with small data set.

Highlights

  • High performance and energy efficiency are critical parameters for emerging applications such as vision and various machine-learning applications [1,2,3,4,5]

  • Though the execution time estimation is nearly impossible for general programs, recent studies have shown that the execution time of general-purpose GPU (GPGPU) tasks are fairly predictable [9,10,11] based on input problem sizes; we decided to use a problem size-based regression model for execution time prediction, similar to the approach of MKMD [10] based on offline profile data

  • We propose a dynamic multiple data-parallel application allocation framework (ADAMS), to efficiently allocating multiple processes to multiple devices

Read more

Summary

Introduction

High performance and energy efficiency are critical parameters for emerging applications such as vision and various machine-learning applications [1,2,3,4,5]. In this situation, load balancing failures among multiple devices result in resource underuse of some devices, and the maximum performance of the system cannot be achieved. Recent GPU evolution trends (NVIDIA Pascal [16] and Volta [17]) improve both the throughput and latency by allowing concurrent execution of multiple kernels using preemptive [18]/spatial multitasking [19] Efficient memory management has become crucial for achieving better multitasking performance because concurrent kernel execution requires more memory to handle all co-running kernels To address this issue, we introduce an automatic device allocation management system (ADAMS). Shared memory includes the allocated application list and remaining total execution time per device

OpenCL Programming Model
Target Device-Selection Challenge
Execution Time Prediction of Data-Parallel Applications
Limitations in Non-Shared-Memory Systems
Limitations in Shared-Memory Systems
Overview
Allocation Manager
Global Memory Analyzer
Concurrent Time Estimator
Time Prediction
Evaluation and Discussion
Allocation Policy on Non-Shared-Memory Systems
Memory Consideration on Non-Shared-Memory Systems
Memory Consideration on Shared-Memory Systems
Case Study
Overhead
Findings
Related Work
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call