Concurrency Using CUDA Streams and Events
Chapter 7 explores the ability of GPUs to perform multiple tasks simultaneously, including overlapping IO with computation and the simultaneous running of multiple kernels. CUDA streams and events are advanced features that allow users to manage multiple asynchronous tasks running on the GPU. Examples are given and the NVIDIA visual profiler (NVVP) is used to visualise the timeline for tasks in multiple CUDA streams. Asynchronous disk IO on the host PC can also be performed and examples using the C++ <threads> are given. Finally, the new CUDA graphs feature is introduced. This provides a wrapper for efficiently launching large numbers of kernel calls for complex workloads.
- Conference Article
3
- 10.1109/asap.2019.00014
- Jul 1, 2019
Base64 encoding has many applications on the Web. Previous studies investigated the optimizations of Base64 encoding algorithm on central processing units (CPUs). In this paper, we describe the optimizations of the algorithm on heterogeneous computing platforms. More specifically, we explain the algorithm, convert the algorithm to kernels written in CUDA C/C++ and Open Computing Language (OpenCL), optimize the CUDA and OpenCL applications with CUDA and OpenCL streams which can overlap data transfers with kernel computations, and vectorize the CUDA and OpenCL kernels to improve kernel throughput. We evaluate the impact of the number of streams upon the kernel performance on an NVIDIA Pascal P100 graphics processing unit (GPU) and a Nallatech 385A card that features an Intel Arria 10 GX1150 field-programmable gate array (FPGA). We also measure the performance and power of the applications on the CPU, GPU, and FPGA to know the advantage of each platform and the benefit of kernel offloading. The experiments show that using vector data types in the kernels is not for performance, and more work-items is better than large vectors per work-item on the GPU. OpenCL and CUDA streams can achieve almost the same performance on the GPU, but streams should be used with caution when GPU resources are underutilized. On the FPGA, kernel vectorization using 16 vector lanes can achieve the highest performance when the number of streams is one. However, increasing the vector width per work-item and the number of streams can decrease the kernel computation time for each stream, and thereby reduce the number of concurrent operations across the streams. While the raw performance on the GPU is 3.1X higher than that on the FPGA, the FPGA consumes 3.4X less power. A comparison with a state-of-the-art implementation on an Intel CPU server shows an increasing benefit of kernel offloading.
- Conference Article
7
- 10.1109/icpads.2009.115
- Jan 1, 2009
Abstract—NVIDIA CUDA and ATI Stream are the two major general-purpose GPU (GPGPU) computing technologies. We implemented RankBoost, a web relevance ranking algo-rithm, on both NVIDIA CUDA and ATI Stream platforms to accelerate the algorithm and illustrate the differences between these two technologies. It shows that the performances of GPU programs are highly dependent on the utilization of GPU’s hardware memory architectural features. In this work, we accelerated RankBoost algorithm on both platforms, and we achieved 22.9X speedup on CUDA and 9.2X speedup on ATI Stream respectively. Then we made a comparison on the differences of memory architecture between NVIDIA CUDA
- Conference Article
16
- 10.1109/aero.2017.7943882
- Mar 1, 2017
Low-power, high-performance, System-on-Chip (SoC) devices, such as the NVIDIA Tegra K1 and Tegra X1, have many potential uses in aerospace applications. Fusing ARM CPUs and a large GPU, Tegra SoCs are well suited for image and signal processing. However, fault masking and tolerance on GPUs is relatively unexplored for harsh environments. With hundreds of GPU cores, a complex caching structure, and a custom task scheduler, Tegra SoCs are vulnerable to a wide range of single-event upsets (SEUs). Triple-modular redundancy (TMR) provides a strong basis for fault masking on a wide range of devices. GPUs pose a unique challenge to a typical TMR implementation. NVIDIA's scheduler assigns tasks based on available resources, but the scheduling process is not publicly documented. As a result, a malfunctioning core could be assigned the same block of code in each TMR module. In this case, a fault could go undetected, impacting the resulting data with an error. Likewise, an upset in the scheduler or cache could have an adverse impact on data integrity. In order to mask and mitigate upsets in GPUs, we propose and investigate a new method that features persistent threading and CUDA Streams with TMR. A persistent thread is a new approach to GPU programming where a kernel's threads run indefinitely. CUDA Streams enable multiple kernels to run concurrently on a single GPU. Combining these two programming paradigms, we remove the vulnerability of scheduler faults, and ensure that each iteration is executed concurrently on different cores, with each instance having its own copy of the data. We evaluate our method with an experiment that uses a Sobel filter applied to a 640×480 image on an NVIDIA Tegra X1. In order to inject faults to verify our method, a separate task corrupts a memory location. Using this simple injector, we are able to simulate an upset in a GPU core or memory location. From this experiment, our results confirm that using persistent threading and CUDA Streams with TMR masks the simulated SEUs on the Tegra X1. Furthermore, we provide performance results to quantify the overhead with this new method.
- Conference Article
3
- 10.1109/hpcs.2018.00091
- Jul 1, 2018
In this paper we evaluate several approaches to performing simultaneous matrix-vector multiplication of large numbers of matrices on a GPU accelerator. The goal of this evaluation is to develop efficient techniques for massively parallel Hybrid Total FETI solvers in our ESPRESO library. FETI solvers generally use sparse matrices. To overcome this we previously proposed the Local Schur Complement method for FETI to convert sparse matrices to their dense representation, without significantly increasing the memory requirements of the GPU accelerator. We selected the following techniques: standard GEMV, CUDA streams, dynamic parallelism, batched GEMM, BSR GEMV and HYB GEMV. Our results show that (i) if a FETI solver contains a large number of small matrices i.e. there is large number of small subdomains, then the best approach is dynamic parallelism; (ii) if there is small number of large subdomains, then the optimal approaches are dynamic parallelism and CUDA streams. Please note that Local Schur Complement method in conjunction with Hybrid Total FETI perform better with smaller subdomains.
- Research Article
48
- 10.1109/bigdata.2014.7004245
- Oct 1, 2014
- Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data
Push-based database management system (DBMS) is a new type of data processing software that streams large volume of data to concurrent query operators. The high data rate of such systems requires large computing power provided by the query engine. In our previous work, we built a push-based DBMS named G-SDMS to harness the unrivaled computational capabilities of modern GPUs. A major design goal of G-SDMS is to support concurrent processing of heterogenous query processing operations and enable resource allocation among such operations. Understanding the performance of operations as a result of resource consumption is thus a premise in the design of G-SDMS. With NVIDIA's CUDA framework as the system implementation platform, we present our recent work on performance modeling of CUDA kernels running concurrently under a runtime mechanism named CUDA stream. Specifically, we explore the connection between performance and resource occupancy of compute-bound kernels and develop a model that can predict the performance of such kernels. Furthermore, we provide an in-depth anatomy of the CUDA stream mechanism and summarize the main kernel scheduling disciplines in it. Our models and derived scheduling disciplines are verified by extensive experiments using synthetic and real-world CUDA kernels.
- Research Article
22
- 10.1109/access.2021.3122466
- Jan 1, 2021
- IEEE Access
Recently, Graphic Processing Units (GPUs) have been widely used for general purpose applications such as machine learning applications, acceleration of cryptographic applications (especially, blockchains), etc. The development of CUDA makes this General-Purpose computing on GPU possible. In particular, currently GPU technology has been widely used for server-side applications so as to provide fast and efficient service to a number of clients. In other words, servers need to process a large amount of user data and execute authentication process. Verifying the integrity of transmitted data is essential for ensuring that the data is not modified during transmission. Hash functions are the cryptographic algorithm which can verify the integrity of data and there are SHA-1, SHA-2, and SHA-3 standard hash functions. In 2015, Keccak algorithm was selected for SHA-3 competition by NIST. However, until now, software implementations of SHA-3 have not provided enough performance for various applications. In addition, SHA-3 and SHAKE using SHA-3 are being used in many Post-Quantum Cryptosystems (PQC) submitted to NIST PQC competition. Therefore, SHA-3 optimization research is required in the software environment. We propose an optimized SHA-3 software implementation on GPU environment. For performance efficiency, we propose several techniques including optimization of SHA-3 internal process, inline PTX optimization, optimized memory usage, and the application of asynchronous CUDA stream. As a result of applying the proposed optimization method, our SHA-3(512) (resp. SHA-3(256)) implementation without CUDA stream provides a maximum throughput of 88.51 Gb/s (resp. 171.62 Gb/s) on RTX2080Ti GPU. Furthermore, without the application of CUDA stream, our SHA-3(512) software on GTX1070 provides about 49.73% improved throughput compared with the previous best work on GTX1080, which shows the superiority of our proposed optimization methods. Our optimized SHA-3 software on GPU can be efficiently used for block-chain applications and several PQCs (especially, key generation process in Lattice-based cryptosystems).
- Research Article
17
- 10.1016/j.sysarc.2023.102888
- Apr 26, 2023
- Journal of Systems Architecture
Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs
- Research Article
4
- 10.1002/cpe.7897
- Aug 29, 2023
- Concurrency and Computation: Practice and Experience
SummaryIn the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the MPI_THREAD_MULTIPLE mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8 NVIDIA Quadro RTX 6000 GPUs, the second with 4 NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2 NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency.
- Research Article
23
- 10.3390/app10113711
- May 27, 2020
- Applied Sciences
With the advent of IoT and Cloud computing service technology, the size of user data to be managed and file data to be transmitted has been significantly increased. To protect users’ personal information, it is necessary to encrypt it in secure and efficient way. Since servers handling a number of clients or IoT devices have to encrypt a large amount of data without compromising service capabilities in real-time, Graphic Processing Units (GPUs) have been considered as a proper candidate for a crypto accelerator for processing a huge amount of data in this situation. In this paper, we present highly efficient implementations of block ciphers on NVIDIA GPUs (especially, Maxwell, Pascal, and Turing architectures) for environments using massively large data in IoT and Cloud computing applications. As block cipher algorithms, we choose AES, a representative standard block cipher algorithm; LEA, which was recently added in ISO/IEC 29192-2:2019 standard; and CHAM, a recently developed lightweight block cipher algorithm. To maximize the parallelism in the encryption process, we utilize Counter (CTR) mode of operation and customize it by using GPU’s characteristics. We applied several optimization techniques with respect to the characteristics of GPU architecture such as kernel parallelism, memory optimization, and CUDA stream. Furthermore, we optimized each target cipher by considering the algorithmic characteristics of each cipher by implementing the core part of each cipher with handcrafted inline PTX (Parallel Thread eXecution) codes, which are virtual assembly codes in CUDA platforms. With the application of our optimization techniques, in our implementation on RTX 2070 GPU, AES and LEA show up to 310 Gbps and 2.47 Tbps of throughput, respectively, which are 10.7% and 67% improved compared with the 279.86 Gbps and 1.47 Tbps of the previous best result. In the case of CHAM, this is the first optimized implementation on GPUs and it achieves 3.03 Tbps of throughput on RTX 2070 GPU.
- Conference Article
4
- 10.1145/2342896.2342977
- Jan 1, 2012
The H.264 standard of MPEG-4 includes motion estimation that takes about 91% of encoding time. Luckily, the problem of block-based motion estimation is highly parallel. Motion vectors are calculated by determining block displacement within an area, typically 32 x 32 pixels, in a known reference frame. We enhance the GPU-based Sum of Absolute Difference (SAD) calculations of motion estimation using CUDA streams to hide memory latency by means of different overlapping techniques. A novel implementation strategy is explored that takes advantage of the amount of shared memory available in GPU devices of compute capability 2.x.
- Conference Article
2
- 10.1109/cem.2013.6617123
- Aug 1, 2013
For the solution of large-scale inverse scattering problems - in either acoustic or electromagnetic domain - gradient based optimization approaches are a method of choice, especially when the derivatives regarding the parameter of interest can be obtained from adjoint fields [1], [2]. Gradients regarding a parameter can be effectively computed using an adjoint approach where the direct and adjoint fields are integrated in opposite temporal direction. This yielding high memory consumption, the memory reduced computation of the gradients using checkpointing and recomputation of states from the checkpoint is a method of choice. We propose the use of graphics processing units (GPU) to accelerate the computation by solving the direct problem on the GPU and the adjoint problem on the CPU of the computer. The implementation of pipelining based on CUDA streams and pinned memory masks the memory transfer between host and GPU and allows for the computation of the adjoint derivatives at only a little more than twice the time of the solution of the direct problem.
- Conference Article
2
- 10.1145/2764967.2764968
- Jun 1, 2015
Stream programming based on the synchronous data flow (SDF) model naturally exposes data, task and pipeline parallelism. Statically scheduling stream programs for homogeneous architectures has been an area of extensive research. With graphic processing units (GPUs) now emerging as general purpose co-processors, scheduling and distribution of these stream programs onto heterogeneous architectures (having both GPUs and CPUs) provides for challenging research. Exploiting this abundant parallelism in hardware, and providing a scalable solution is a hard problem.In this paper we describe a coarse-grained software pipelined scheduling algorithm for stream programs which statically schedules a stream graph onto heterogeneous architectures. We formulate the problem of partitioning the work between the CPU cores and the GPU as a model-checking problem. The partitioning process takes into account the costs of the required buffer layout transformations associated with the partitioning and the distribution of the stream graph. The solution trace result from the model checking provides a map for the distribution of actors across different processors/-cores. This solution is then divided into stages, and then a coarse grained software-pipelined code is generated. We use CUDA streams to map these programs synergistically onto the CPU and GPUs. We use a performance model for data transfers to determine the optimal number of CUDA streams on GPUs. Our software-pipelined schedule yields a speedup of upto 55.86X and a geometric mean speedup of 9.62X over a single threaded CPU.
- Research Article
- 10.1145/3773039
- Dec 17, 2025
- ACM Transactions on Design Automation of Electronic Systems
The acceleration of inference process for deep learning models is closely tied with the parallelization capability of computational graph operators and the parallel scheduling strategies. Most existing deep learning compilers focus on optimizing intra-operator parallelism, while neglecting inter-operator parallelism. Furthermore, most industrial inference engines, such as PyTorch and TensorFlow, utilize a dataflow-based model to describe tasks and schedule operators. They are computationally expensive and operate in a topological order and are parallelized to run within a single CUDA stream. However, they fail to fully exploit the parallelism capabilities of multiple CUDA streams. In this article, we propose PPD, a portable, highly parallel dispatching system. It boosts the inference performance by dividing the computational graph into multiple taskflow-based subgraphs. Additionally, PPD entails a dispatching algorithm on a single GPU with multiple CUDA streams to enhance the parallelism and performance of model inference. PPD offers users a lightweight model definition and an inference C++ interface, allowing for seamless integration into any context. We also verify the feasibility of PPD on AMD and other graphics cards. We validate our approach on widely adopted neural network models with varying degrees of parallelism, and compare it with industrial inference engines. Experiments demonstrate that PPD outperforms SOTA methods by up to 2.28× .
- Research Article
- 10.12783/dtmse/ameme2020/35540
- Apr 7, 2021
- DEStech Transactions on Materials Science and Engineering
In order to improve the reconfigurability and computing efficiency of the polyphase channelization system, a new algorithm based on CUDA stream architecture was designed and optimized. Firstly, the principle of parallel channelization algorithm without blind zones is introduced. Then, various resource constraints under CUDA architecture and the relationship between the operating efficiency and parameters of the CUDA kernel under these constraints are analysed. The implementation structure of polyphase channelization algorithm is designed. Finally, NVIDIA GPU is used to implement and test the polyphase channelization algorithm based on CUDA stream. The result proves that the computational efficiency of the structure designed in this paper meets the real-time requirements, and can get about 10% efficiency improvement compared with the traditional algorithm.
- Conference Article
5
- 10.1145/3366428.3380773
- Feb 23, 2020
Multiresolution filters, analyzing information at different scales, are crucial for many applications in digital image processing. The different space and time complexity at distinct scales in the unique pyramidal structure poses a challenge as well as an opportunity to implementations on modern accelerators such as GPUs with an increasing number of compute units. In this paper, we exploit the potential of concurrent kernel execution in multiresolution filters. As a major contribution, we present a model-based approach for performance analysis of as well single- as multi-stream implementations, combining both application- and architecture-specific knowledge. As a second contribution, the involved transformations and code generators using CUDA streams on Nvidia GPUs have been integrated into a compiler-based approach using an image processing DSL called Hipacc. We then apply our approach to evaluate and compare the achieved performance for four real-world applications on three GPUs. The results show that our method can achieve a geometric mean speedup of up to 2.5 over the original Hipacc implementation without our approach, up to 2.0 over the other state-of-the-art DSL Halide, and up to 1.3 over the recently released programming model CUDA Graph from Nvidia.