Graphics Processing Unit Systems Research Articles

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces applicationlevel unfairness by 22.4%. MASK's system throughput is within 23.2% of an ideal GPU system with no address translation overhead.

Read full abstract

To understand the mechanism of information processing by a biological neural network, computer simulation of a large-scale spiking neural network is an important method. However, because of a high computation cost of the simulation of a large-scale spiking neural network, the simulation requires high performance computing implemented by a supercomputer or a computer cluster. Recently, hardware for parallel computing such as a multi-core CPU and a graphics card with a graphics processing unit (GPU) is built in a gaming computer and a workstation. Thus, parallel computing using this hardware is becoming widespread, allowing us to obtain powerful computing power for simulation of a large-scale spiking neural network. However, it is not clear how much increased performance the parallel computing method using a new GPU yields in the simulation of a large-scale spiking neural network. In this study, we compared computation time between the computing methods with CPUs and GPUs in a simulation of neuronal models. We developed computer programs of neuronal simulations for the computing systems that consist of a gaming graphics card with new architecture (the NVIDIA GTX 1080) and an accelerator board using a GPU (the NVIDIA Tesla K20C). Our results show that the computing systems can perform a simulation of a large number of neurons faster than CPU-based systems. Furthermore, we investigated the accuracy of a simulation using single precision floating point. We show that the simulation results of single precision were accurate enough compared with those of double precision, but chaotic neuronal response calculated by a GPU using single precision is prominently different from that calculated by a CPU using double precision. Furthermore, the difference in chaotic dynamics appeared even if we used double precision of a GPU. In conclusion, the GPU-based computing system exhibits a higher computing performance than the CPU-based system, even if the GPU system includes data transfer from a graphics card to host memory.

Read full abstract

Graphics Processing Unit Systems Research Articles

Related Topics

Articles published on Graphics Processing Unit Systems

Mathematical Analysis for GPU Framework for High Performance Computing with Improved Scalability, Reliability and Seamless Configurability

Sustainable Optimizing Performance and Energy Efficiency in Proof of Work Blockchain: A Multilinear Regression Approach

Advanced hybrid MRAM based novel GPU cache system for graphic processing with high efficiency

A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain Utilization

Generic parallel data structures and algorithms to GPU superpixel image segmentation

Fast infrared radiative transfer calculations using graphics processing units: JURASSIC-GPU v2.0

NURA

A Spectroscopic Diffuse Optical Tomography System for the Continuous 3-D Functional Imaging of Tissue—A Phantom Study

Object Detection with Low Capacity GPU Systems Using Improved Faster R-CNN

MASK

Evaluation of the computational efficacy in GPU-accelerated simulations of spiking neurons

Parallelization and Performance of the NIM Weather Model on CPU, GPU, and MIC Processors

Real-time computation of parameter fitting and image reconstruction using graphical processing units

An accelerated framework for the classification of biological targets from solid-state micropore data

High-Efficiency Computing Technology for Thermohydrodynamic Lubrication Analysis

GPURFSCREEN: a GPU based virtual screening tool using random forest classifier.

Adaptive Scheduling Framework for Real-Time Video Encoding on Heterogeneous Systems

Performance of a projection method for incompressible flows on heterogeneous hardware

Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Graphics Processing Unit Systems Research Articles

Related Topics

Articles published on Graphics Processing Unit Systems

Mathematical Analysis for GPU Framework for High Performance Computing with Improved Scalability, Reliability and Seamless Configurability

Sustainable Optimizing Performance and Energy Efficiency in Proof of Work Blockchain: A Multilinear Regression Approach

Advanced hybrid MRAM based novel GPU cache system for graphic processing with high efficiency

A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain Utilization

Generic parallel data structures and algorithms to GPU superpixel image segmentation

Fast infrared radiative transfer calculations using graphics processing units: JURASSIC-GPU v2.0

NURA

A Spectroscopic Diffuse Optical Tomography System for the Continuous 3-D Functional Imaging of Tissue—A Phantom Study

Object Detection with Low Capacity GPU Systems Using Improved Faster R-CNN

MASK

Evaluation of the computational efficacy in GPU-accelerated simulations of spiking neurons

Parallelization and Performance of the NIM Weather Model on CPU, GPU, and MIC Processors

Real-time computation of parameter fitting and image reconstruction using graphical processing units

An accelerated framework for the classification of biological targets from solid-state micropore data

High-Efficiency Computing Technology for Thermohydrodynamic Lubrication Analysis

GPURFSCREEN: a GPU based virtual screening tool using random forest classifier.

Adaptive Scheduling Framework for Real-Time Video Encoding on Heterogeneous Systems

Performance of a projection method for incompressible flows on heterogeneous hardware

Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems.