CAVLCU: an efficient GPU-based implementation of CAVLC

Antonio Fuentes-Alventosa,R Medina-Carnicer,Juan Gómez-Luna,Nicolás Guil,José Maria González-Linares

doi:10.1007/s11227-021-04183-8

Antonio Fuentes-Alventosa, R Medina-Carnicer + Show 3 more

Open Access

https://doi.org/10.1007/s11227-021-04183-8

Copy DOI

Abstract

CAVLC (Context-Adaptive Variable Length Coding) is a high-performance entropy method for video and image compression. It is the most commonly used entropy method in the video standard H.264. In recent years, several hardware accelerators for CAVLC have been designed. In contrast, high-performance software implementations of CAVLC (e.g., GPU-based) are scarce. A high-performance GPU-based implementation of CAVLC is desirable in several scenarios. On the one hand, it can be exploited as the entropy component in GPU-based H.264 encoders, which are a very suitable solution when GPU built-in H.264 hardware encoders lack certain necessary functionality, such as data encryption and information hiding. On the other hand, a GPU-based implementation of CAVLC can be reused in a wide variety of GPU-based compression systems for encoding images and videos in formats other than H.264, such as medical images. This is not possible with hardware implementations of CAVLC, as they are non-separable components of hardware H.264 encoders. In this paper, we present CAVLCU, an efficient implementation of CAVLC on GPU, which is based on four key ideas. First, we use only one kernel to avoid the long latency global memory accesses required to transmit intermediate results among different kernels, and the costly launches and terminations of additional kernels. Second, we apply an efficient synchronization mechanism for thread-blocks (In this paper, to prevent confusion, a block of pixels of a frame will be referred to as simply block and a GPU thread block as thread-block.) that process adjacent frame regions (in horizontal and vertical dimensions) to share results in global memory space. Third, we exploit fully the available global memory bandwidth by using vectorized loads to move directly the quantized transform coefficients to registers. Fourth, we use register tiling to implement the zigzag sorting, thus obtaining high instruction-level parallelism. An exhaustive experimental evaluation showed that our approach is between 2.5times and 5.4times faster than the only state-of-the-art GPU-based implementation of CAVLC.

Highlights

In the current digital era, the massive use of multimedia data, such as images and videos, together with the necessity to overcome the restrictions of storage space and communication bandwidth, have given an essential role to data compression
We present CAVLCU, an optimized implementation of Context‐adaptive variable length coding (CAVLC) on GPU developed in CUDA
A highly optimized GPU-based approach to CAVLC implemented in CUDA

Summary

CoeffToken

The magnitude of nonzero coefficients tends to be larger at the start of the zigzag array, near the first coefficient, and smaller towards the higher frequencies. The VLC assigned to CoeffToken is obtained from a lookup table that, in the case of a 4 × 4 block, is chosen from three VLC tables and one 6-bit fixed length code table, whose contents are specified in Table 9-5 of the H.264 standard [14]. 0000 0000 0000 1000 0000 0000 0001 00 0000 0000 10 1111 11 lookup table is done in function of a parameter nC, which is calculated from the number of coefficients in the blocks to the left and above of the current block (parameters nA and nB, respectively). Left and top blocks are available Only the left block is available Only the top block is available Neither neighbouring block is available nC (nA + nB + 1) >>1 nA nB 0

Levels

CAVLC example

Two scans

Coding

Calculation of block and MB indexes

Coefficients reading

Zigzag sorting

Calculation of the symbols

Block encoding

Experimental evaluation

CAVLC applications

Conclusions

Findings

26. NVIDIA

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: The Journal of Supercomputing	Publication Date: Nov 29, 2021
Citations: 1	License type: open-access

R Discovery Prime

R Discovery Prime

CAVLCU: an efficient GPU-based implementation of CAVLC

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The Journal of Supercomputing

Lead the way for us

Similar Papers

A VLSI architecture design of CAVLC decoder
Wu Di ... Ji Zhenzhou
-
Wu Di, et. al. Wu Di ... Ji Zhenzhou
01 Jan 2003
01 Jan 2003

建構於 H.264 無損幀內編碼的變動長度編碼法

-

01 Jan 2012
01 Jan 2012

Efficient implementation of a 3-D medical imaging compression system using CAVLC
A Ahmad ... A Amira
-
A Ahmad, et. al.A Ahmad ... A Amira
01 Sep 2010
01 Sep 2010

Medical image compression using advanced coding technique
K.V Sridhar ... K.S.R.Krishna Prasad
-
K.V Sridhar, et. al.K.V Sridhar ... K.S.R.Krishna Prasad
01 Oct 2008
01 Oct 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CAVLCU: an efficient GPU-based implementation of CAVLC

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The Journal of Supercomputing