Thread Block Research Articles

Intra-GPU synchronization is a problem for GPU controlled communication.Options, based on dynamic parallelism provide on-device synchronization.GPU controlled communication have a lower performance than CPU assisted approaches.Relieving the CPU from the communication work leads to lower energy consumption.Serialization leads to high number of instruction replays on GPUs. Graphic Processing Units (GPUs) are widely used in high performance computing, due to their high computational power and high performance per Watt. However, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. The most common way to utilize a GPU cluster is a hybrid model, in which the GPU is used to accelerate the computation, while the CPU is responsible for the communication. This approach always requires a dedicated CPU thread, which consumes additional CPU cycles and therefore increases the power consumption of the complete application. In recent work we have shown that the GPU is able to control the communication independently of the CPU. However, there are several problems with GPU-controlled communication. The main problem is intra-GPU synchronization, since GPU blocks are non-preemptive. Therefore, the use of communication requests within a GPU can easily result in a deadlock. In this work we show how dynamic parallelism solves this problem. GPU-controlled communication in combination with dynamic parallelism allows keeping the control flow of multi-GPU applications on the GPU and bypassing the CPU completely. Using other in-kernel synchronization methods results in massive performance losses, due to the forced serialization of the GPU thread blocks. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, we will show that performance per Watt increases by up to 10% while still using commodity hardware.

High resolution cameras and multi camera systems are being used in areas of video surveillance like security of public places, traffic monitoring, and military and satellite imaging. This leads to a demand for computational algorithms for real time processing of high resolution videos. Motion detection and background separation play a vital role in capturing the object of interest in surveillance videos, but as we move towards high resolution cameras, the time-complexity of the algorithm increases and thus fails to be a part of real time systems. Parallel architecture provides a surpass platform to work efficiently with complex algorithmic solutions. In this work, a method was proposed for identifying the moving objects perfectly in the videos using adaptive background making, motion detection and object estimation. The pre-processing part includes an adaptive block background making model and a dynamically adaptive thresholding technique to estimate the moving objects. The post processing includes a competent parallel connected component labelling algorithm to estimate perfectly the objects of interest. New parallel processing strategies are developed on each stage of the algorithm to reduce the time-complexity of the system. This algorithm has achieved a average speedup of 12.26 times for lower resolution video frames (320×240, 720×480, 1024×768) and 7.30 times for higher resolution video frames (1360×768, 1920×1080, 2560×1440) on GPU, which is superior to CPU processing. Also, this algorithm was tested by changing the number of threads in a thread block and the minimum execution time has been achieved for 16×16 thread block. And this algorithm was tested on a night sequence where the amount of light in the scene is very less and still the algorithm has given a significant speedup and accuracy in determining the object.

Thread Block Research Articles

Related Topics

Articles published on Thread Block

Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Fast implementation of block ciphers and PRNGs in Maxwell GPU architecture

A new approach for real time object detection and tracking on high resolution and multi-camera surveillance videos using GPU

Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU

Interval-based performance modeling for the all-pairs-shortest-path problem on GPUs

Speedup of Learning in Interval Type-2 Neural Fuzzy Systems Through Graphic Processing Units

Dynamic thread block launch

Wavelet-Based Classification of Hyperspectral Images Using Extended Morphological Profiles on Graphics Processing Units

A Parallel High Speed Lossless Data Compression Algorithm in Large-Scale Wireless Sensor Network

Correlation ratio based volume image registration on GPUs

An analytical GPU performance model for 3D stencil computations from the angle of data traffic

Phase quality map based on local multi-unwrapped results for two-dimensional phase unwrapping.

Improving GPU Memory Performancewith Artificial Barrier Synchronization

Enhancement of membrane computing model implementation on GPU by introducing matrix representation for balancing occupancy and reducing inter-block communications

Parallel and distributed computing models on a graphics processing unit to accelerate simulation of membrane systems

Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures

Real-time brain extraction method from cerebral MRI volume based on graphic processing units

Improving cache locality for GPU-based volume rendering

Singe

Implementation of GPU virtualization using PCI pass-through mechanism

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Thread Block Research Articles

Related Topics

Articles published on Thread Block

Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Fast implementation of block ciphers and PRNGs in Maxwell GPU architecture

A new approach for real time object detection and tracking on high resolution and multi-camera surveillance videos using GPU

Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU

Interval-based performance modeling for the all-pairs-shortest-path problem on GPUs

Speedup of Learning in Interval Type-2 Neural Fuzzy Systems Through Graphic Processing Units

Dynamic thread block launch

Wavelet-Based Classification of Hyperspectral Images Using Extended Morphological Profiles on Graphics Processing Units

A Parallel High Speed Lossless Data Compression Algorithm in Large-Scale Wireless Sensor Network

Correlation ratio based volume image registration on GPUs

An analytical GPU performance model for 3D stencil computations from the angle of data traffic

Phase quality map based on local multi-unwrapped results for two-dimensional phase unwrapping.

Improving GPU Memory Performancewith Artificial Barrier Synchronization

Enhancement of membrane computing model implementation on GPU by introducing matrix representation for balancing occupancy and reducing inter-block communications

Parallel and distributed computing models on a graphics processing unit to accelerate simulation of membrane systems

Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures

Real-time brain extraction method from cerebral MRI volume based on graphic processing units

Improving cache locality for GPU-based volume rendering

Singe

Implementation of GPU virtualization using PCI pass-through mechanism