Abstract

The unrivaled computing capabilities of modern GPUs meet the demand of processing massive amounts of data seen in many application domains. While traditional HPC systems support applications as standalone entities that occupy entire GPUs, there are GPU-based DBMSs where multiple tasks are meant to be run at the same time in the same device. To that end, system-level resource management mechanisms are needed to fully unleash the computing power of GPUs in large data processing, and there were some researches focusing on it. In our previous work, we explored the single compute-bound kernel modeling on GPUs under NVidia’s CUDA framework and provided an in-depth anatomy of the NVidia’s concurrent kernel execution mechanism (CUDA stream). This paper focuses on resource allocation of multiple GPU applications towards optimization of system throughput in the context of systems. Comparing to earlier studies of enabling concurrent tasks support on GPU such as MultiQx-GPU, we use a different approach that is to control the launching parameters of multiple GPU kernels as provided by compile-time performance modeling as a kernel-level optimization and also a more general pre-processing model with batch-level control to enhance performance. Specifically, we construct a variation of multi-dimensional knapsack model to maximize concurrency in a multi-kernel environment. We present an in-depth analysis of our model and develop an algorithm based on dynamic programming technique to solve the model. We prove the algorithm can find optimal solutions (in terms of thread concurrency) to the problem and bears pseudopolynomial complexity on both time and space. Such results are verified by extensive experiments running on our microbenchmark that consists of real-world GPU queries. Furthermore, solutions identified by our method also significantly reduce the total running time of the workload, as compared to sequential and MultiQx-GPU executions.

Highlights

  • With the recent development of semiconductor technology, the number of processing units integrated on a chip increases rapidly, resulting in massively parallel computing capability

  • In our previous work [6], we proposed a GPGPU-based Scientific Data Management System (G-SDMS) that uses Compute Unified Device Architecture (CUDA)-supported GPUs as the platform for query processing in a push-based manner

  • We present a general scheme in optimizing the concurrency and overall performance of heterogeneous tasks under the Compute Unified Device Architecture (CUDA) environment [19]

Read more

Summary

Objectives

We aim to transform the model into a form that is easier to handle via considering the actual environment where our problem is defined. The objective of this study is to allocate resources to concurrent CUDA kernels by configuring their runtime parameters for the purpose of maximizing system performance

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call