CUDA Streams Research Articles

With the development of engineering technology, engineering has higher requirements for the accuracy and the scale of simulation calculation. The computational efficiency of traditional serial programs cannot meet the requirements of engineering. Therefore, reducing the calculation time of the temperature control simulation program has important engineering significance for real-time simulation of temperature field and stress field, and then adopting more reasonable temperature control and crack prevention measures. GPU parallel computing is introduced into the temperature control simulation program of massive concrete to solve this problem and the optimization is carried out. Considering factors such as GPU clock rate, number of cores, parallel overhead and Parallel Region, the improved GPU parallel algorithm analysis indicator formula is proposed. It makes up for the shortcomings of traditional formulas that focus only on time. According to this formula, when there are enough threads, the parallel effect is limited by the size of the parallel domain, and when the parallel domain is large enough, the efficiency is limited by the parallel overhead and the clock rate. This paper studies the optimal Kernel execution configuration. Shared memory is utilized to improve memory access efficiency by 155%. After solving the problem of bank conflicts, an accelerate rate of 437.5× was realized in the subroutine of the matrix transpose of the solver. The asynchronous parallel of data access and logical operation is realized on GPU by using CUDA Stream, which can overlap part of the data access time. On the basis of GPU parallelism, asynchronous parallelism can double the computing efficiency. Compared with the serial program, the accelerate rate of inner product matrix multiplication of the GPU asynchronous parallel program is 61.42×. This study further proposed a theoretical formula of data access overlap rate to guide the selection of the number of CUDA streams to achieve the optimal computing conditions. The GPU parallel program compiled and optimized by the CUDA Fortran platform can effectively improve the computational efficiency of the simulation program for concrete temperature control, and better serve engineering computing.

Recently, Graphic Processing Units (GPUs) have been widely used for general purpose applications such as machine learning applications, acceleration of cryptographic applications (especially, blockchains), etc. The development of CUDA makes this General-Purpose computing on GPU possible. In particular, currently GPU technology has been widely used for server-side applications so as to provide fast and efficient service to a number of clients. In other words, servers need to process a large amount of user data and execute authentication process. Verifying the integrity of transmitted data is essential for ensuring that the data is not modified during transmission. Hash functions are the cryptographic algorithm which can verify the integrity of data and there are SHA-1, SHA-2, and SHA-3 standard hash functions. In 2015, Keccak algorithm was selected for SHA-3 competition by NIST. However, until now, software implementations of SHA-3 have not provided enough performance for various applications. In addition, SHA-3 and SHAKE using SHA-3 are being used in many Post-Quantum Cryptosystems (PQC) submitted to NIST PQC competition. Therefore, SHA-3 optimization research is required in the software environment. We propose an optimized SHA-3 software implementation on GPU environment. For performance efficiency, we propose several techniques including optimization of SHA-3 internal process, inline PTX optimization, optimized memory usage, and the application of asynchronous CUDA stream. As a result of applying the proposed optimization method, our SHA-3(512) (resp. SHA-3(256)) implementation without CUDA stream provides a maximum throughput of 88.51 Gb/s (resp. 171.62 Gb/s) on RTX2080Ti GPU. Furthermore, without the application of CUDA stream, our SHA-3(512) software on GTX1070 provides about 49.73% improved throughput compared with the previous best work on GTX1080, which shows the superiority of our proposed optimization methods. Our optimized SHA-3 software on GPU can be efficiently used for block-chain applications and several PQCs (especially, key generation process in Lattice-based cryptosystems).

CUDA Streams Research Articles

Related Topics

Articles published on CUDA Streams

Parallel Implementation of Lightweight Secure Hash Algorithm on CPU and GPU Environments

CuXCMP: CUDA-Accelerated Private Comparison Based on Homomorphic Encryption

Research on the Application and Performance Optimization of GPU Parallel Computing in Concrete Temperature Control Simulation

A fast, dense Chebyshev solver for electronic structure on GPUs.

Distributed out-of-memory NMF on CPU/GPU architectures

A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems

Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs

Efficient parallel implementation of crowd simulation using a hybrid CPU+GPU high performance computing system

PERFORMANCE ENHANCEMENT OF CUDA APPLICATIONS BY OVERLAPPING DATA TRANSFER AND KERNEL EXECUTION

Fast period searches using the Lomb–Scargle algorithm on Graphics Processing Units for large datasets and real-time applications

Design of Polyphase Channelization Algorithm Based on CUDA Stream Architecture

Implementation of VLBI Digital Baseband Converter with CUDA

Fast Implementation of SHA-3 in GPU Environment

High-speed Parallel Feature Extraction Algorithm of Wind Tunnel Image Based on GPU

MTFC: A Multi-GPU Training Framework for Cube-CNN-Based Hyperspectral Image Classification

Highly Efficient Implementation of Block Ciphers on Graphic Processing Units for Massively Large Data

GPU Parallelization of a Hybrid Pseudospectral Geophysical Turbulence Framework Using CUDA

Investigation of Parallel Data Processing Using Hybrid High Performance CPU

Exploiting potential of deep neural networks by layer-wise fine-grained parallelism

Concurrent query processing in a GPU-based database system.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

CUDA Streams Research Articles

Related Topics

Articles published on CUDA Streams

Parallel Implementation of Lightweight Secure Hash Algorithm on CPU and GPU Environments

CuXCMP: CUDA-Accelerated Private Comparison Based on Homomorphic Encryption

Research on the Application and Performance Optimization of GPU Parallel Computing in Concrete Temperature Control Simulation

A fast, dense Chebyshev solver for electronic structure on GPUs.

Distributed out-of-memory NMF on CPU/GPU architectures

A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems

Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs

Efficient parallel implementation of crowd simulation using a hybrid CPU+GPU high performance computing system

PERFORMANCE ENHANCEMENT OF CUDA APPLICATIONS BY OVERLAPPING DATA TRANSFER AND KERNEL EXECUTION

Fast period searches using the Lomb–Scargle algorithm on Graphics Processing Units for large datasets and real-time applications

Design of Polyphase Channelization Algorithm Based on CUDA Stream Architecture

Implementation of VLBI Digital Baseband Converter with CUDA

Fast Implementation of SHA-3 in GPU Environment

High-speed Parallel Feature Extraction Algorithm of Wind Tunnel Image Based on GPU

MTFC: A Multi-GPU Training Framework for Cube-CNN-Based Hyperspectral Image Classification

Highly Efficient Implementation of Block Ciphers on Graphic Processing Units for Massively Large Data

GPU Parallelization of a Hybrid Pseudospectral Geophysical Turbulence Framework Using CUDA

Investigation of Parallel Data Processing Using Hybrid High Performance CPU

Exploiting potential of deep neural networks by layer-wise fine-grained parallelism

Concurrent query processing in a GPU-based database system.