Graphics Processing Units Kernels Research Articles

Many emerging cyber-physical systems, such as autonomous vehicles and robots, rely heavily on artificial intelligence and machine learning algorithms to perform important system operations. Since these highly parallel applications are computationally intensive, they need to be accelerated by graphics processing units (GPUs) to meet stringent timing constraints. However, despite the wide adoption of GPUs, efficiently scheduling multiple GPU applications while providing rigorous real-time guarantees remains a challenge. Each GPU application has multiple CPU execution and memory copy segments, with GPU kernels running on different hardware resources. Because of the complicated interactions between heterogeneous segments of parallel tasks, high schedulability is hard to achieve with conventional approaches. This paper proposes RTGPU, which combines fine-grain GPU partitioning on the system side with a novel scheduling algorithm on the theory side. Through system and theory co-design, RTGPU achieves superior system throughput and real-time schedulability. In this paper, we start by building a model for the CPU and memory copy segments. Leveraging persistent threads, we then implement fine-grained GPU partitioning with improved performance through interleaved execution. To reap the benefits of fine-grained GPU partitioning and schedule multiple parallel GPU applications, we propose a novel real-time scheduling algorithm based on federated scheduling and grid search with uniprocessor fixed-priority scheduling. Our approach provides real-time guarantees to meet hard deadlines, and achieves over 11% improvement in system throughput and up to 57% schedulability improvement compared with previous work. We validate and evaluate RTGPU on NVIDIA GTX1080Ti GPU systems. Our system side techniques can be applied on mainstream NVIDIA GPUs, and the proposed scheduling theory can be used in general heterogeneous computing platforms which have a similar task execution pattern.

Read full abstract

In 5G New Radio (NR), low-density parity-check (LDPC) codes are included as the error correction codes (ECC) for the data channel. While LDPC codes enable a low, near Shannon capacity, bit error rate (BER), they also become a computational bottleneck in the physical layer processing. Moreover, 5G LDPC has new challenges not seen in previous LDPC implementations, such as Wi-Fi. The LDPC specification in 5G includes many reconfigurations to support a variety of rates, block sizes, and use cases. 5G also creates targets for supporting high-throughput and low-latency applications. For this new, flexible standard, traditional hardware-based solutions in FGPA and ASIC may struggle to support all cases and may be cost-prohibitive at scale. Software solutions can trivially support all possible reconfigurations but struggle with performance. This article demonstrates the high-throughput and low-latency capabilities of graphics processing units (GPUs) for LDPC decoding as an alternative to FPGA and ASIC decoders, effectively providing the high performance needed while maintaining the benefits of a software-based solution. In particular, we highlight how by varying the parallelization strategy for mapping GPU kernels to blocks, we can use the many GPU cores to compute one codeword quickly to target low-latency, or we can use the cores to work on many codewords simultaneously to target high throughput applications. This flexibility is particularly useful for virtualized radio access networks (vRAN), a next-generation technology that is expected to become more prominent in the coming years. In vRAN, the hardware computational resources will become decoupled from the specific computational functions in the RAN through virtualization, allowing for benefits such as load-balancing, improved scalability, and reduced costs. To highlight and investigate how the GPU can accelerate tasks such as LDPC decoding when containerizing vRAN functionality, we integrate our decoder into the Open Air Interface (OAI) NR software stack. With our GPU-based decoder, we measure a best case-latency of 87 μs and a best-case throughput of nearly 4 Gbps using the Titan RTX GPU.

Read full abstract

Graphics Processing Units Kernels Research Articles

Articles published on Graphics Processing Units Kernels

Time Predictable Modeling Method for GPU Architecture with SIMT and Cache Miss Awareness

Accelerating Static Timing Analysis Using CPU–GPU Heterogeneous Parallelism

A Multi-GPU Aggregation-Based AMG Preconditioner for Iterative Linear Solvers

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain Utilization

G-RMOS: GPU-accelerated Riemannian Metric Optimization on Surfaces

Investigating the effect of varying block size on power and energy consumption of GPU kernels

Fast GPU-Based Generation of Large Graph Networks From Degree Distributions.

Toward exascale whole-device modeling of fusion devices: Porting the GENE gyrokinetic microturbulence code to GPU

Inference of dynamic spatial GRN models with multi-GPU evolutionary computation.

Idempotence-Based Preemptive GPU Kernel Scheduling for Embedded Systems

GPU-Based, LDPC Decoding for 5G and Beyond

GPU-Accelerated Adaptive PCBSO Mode-Based Hybrid RLA for Sparse LU Factorization in Circuit Simulation

GPGPU Performance Estimation With Core and Memory Frequency Scaling

CRState: checkpoint/restart of OpenCL program for in-kernel applications

Optimizing non-coalesced memory access for irregular applications with GPU computing

Efficient Parallelization of a Genetic Algorithm Solution on the Traveling Salesman Problem with Multi-core and Many-core Systems

A framework for scheduling dependent programs on GPU architectures

Genetic algorithm based estimation of non–functional properties for GPGPU programs

GPU empowered pipelines for calculating genome-wide kinship matrices with ultra-high dimensional genetic variants and facilitating 1D and 2D GWAS.

A static analytical performance model for GPU kernel

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Graphics Processing Units Kernels Research Articles

Articles published on Graphics Processing Units Kernels

Time Predictable Modeling Method for GPU Architecture with SIMT and Cache Miss Awareness

Accelerating Static Timing Analysis Using CPU–GPU Heterogeneous Parallelism

A Multi-GPU Aggregation-Based AMG Preconditioner for Iterative Linear Solvers

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain Utilization

G-RMOS: GPU-accelerated Riemannian Metric Optimization on Surfaces

Investigating the effect of varying block size on power and energy consumption of GPU kernels

Fast GPU-Based Generation of Large Graph Networks From Degree Distributions.

Toward exascale whole-device modeling of fusion devices: Porting the GENE gyrokinetic microturbulence code to GPU

Inference of dynamic spatial GRN models with multi-GPU evolutionary computation.

Idempotence-Based Preemptive GPU Kernel Scheduling for Embedded Systems

GPU-Based, LDPC Decoding for 5G and Beyond

GPU-Accelerated Adaptive PCBSO Mode-Based Hybrid RLA for Sparse LU Factorization in Circuit Simulation

GPGPU Performance Estimation With Core and Memory Frequency Scaling

CRState: checkpoint/restart of OpenCL program for in-kernel applications

Optimizing non-coalesced memory access for irregular applications with GPU computing

Efficient Parallelization of a Genetic Algorithm Solution on the Traveling Salesman Problem with Multi-core and Many-core Systems

A framework for scheduling dependent programs on GPU architectures

Genetic algorithm based estimation of non–functional properties for GPGPU programs

GPU empowered pipelines for calculating genome-wide kinship matrices with ultra-high dimensional genetic variants and facilitating 1D and 2D GWAS.

A static analytical performance model for GPU kernel