In this work, we propose a flexible 1D DCT hardware design with a constant throughput of 32 pixels per cycle for all transform block (TB) size modes. The design supports all TB sizes defined in the HEVC standard. Flexibility is achieved by multiplexing only the outputs, which results in lower data path delays compared to other proposed designs. Also, the additional partial butterfly units are transferred to the output of transformation cores. The transformation operation is done in a single stage to minimize the latency of the design and reduce hardware usage. The highly parallel input reduces the need for a very high operational frequency, which is suitable for low-power FPGA designs. Four different reusable transformation cores are used that are designed using parallel Multiple Constant Multiplication (MCM) units to further reduce the calculation time. The design was implemented on the Virtex UltraScale + device. The implementation has hardware usage of 21,818 LUTs, and it can reach the maximal throughput of 4.90 Gsps at the working frequency of 153 MHz which is enough to support the video resolutions of up to 8192 × 4320@60fps. Comparison with the other works shows that DCT FPGA implementation without DSPs can reach the performance of the ASICs with trade-offs in power consumption.
Read full abstract