Abstract

H.264 is the newest video coding standard developed by the Joint Video Team (JVT). Compared with MPEG-4, H.263, and MPEG-2, H.264 can reduce 39%, 49%, and 64% of bit-rate, respectively. Because of its superior performance, H.264 has been widely adopted by commercial applications including digital TV broadcasting (European DVB-T and Japanese HDTV), next-generation DVD (Blu-ray DVD and HD-DVD), and network streaming (Apple QuickTime). The coding efficiency improvement of H.264 comes at the price of huge computation and complexity. For our targeted specification (baseline profile level 4.1), the computation of more than 83 Giga-instructions per second and the bandwidth of more than 70 Giga-bytes per second are required. Moreover, new functions such as advanced prediction schemes and deblocking filter increase the complexity of the system. To fulfill the requirements of H.264 high definition applications, an efficient system design is very necessary. Traditional video decoding hardware designs are mostly based on macroblock pipeline. However, if this traditional design methodology is directly adopted in H.264 decoder design, much on-chip memory is wasted. New features of coding tools also make the module-wise design very challenging. For ultra high-end applications, the entropy decoder becomes the throughput bottleneck, while intuitive parallel processing techniques are not applicable to speed up the entropy decoder due to its context-based adaptive nature. Because of variable block sizes and quarter-pixel-precision motion vector features, the motion compensated inter prediction module consumes bandwidth of more than three times that of previous standard MPEG-4 SP. The frame-based deblocking operation seriously degrades system hardware utilization and the deblocking filtering has to be supported in two directions (horizontal and vertical) leading to complex data flow and control. We propose a hybrid task pipelining system to address these crucial issues. Balanced pipelining schedules and proper degrees of parallelism are contributed to deliver the huge and complex computation capability. Block-level, macroblock-level, and macroblock/frame-level pipelining schedules are arranged for CAVLD/IQ/IT/INTRA_PRED, INTER_PRED, and DEBLOCK, respectively. As a result, the resulted internal pipeline memory as well as the bandwidth consumption can be significantly reduced. Moreover, efficient modules are provided. The entropy decoder unit smoothly decodes bitstream into symbols without bubble cycles thus high decoding throughput can be achieved, and the proposed CAVLD unit can be extended to higher parallelism with low area overhead because only the Level table and the Run table are modified. The proposed memory access scheme of Interpolation Window Reuse (IWR) and Interpolation Window Classification (IWC) of the motion compensated inter prediction unit saves 60% of external memory bandwidth, and the proposed processing order of 4x4-blocks for inter prediction enables high utilization of the reuse buffer. DEBLOCK unit breaks the frame-level deblocking operation to macroblock-level operations so that the hardware utilization can be greatly increased. Our proposed transpose array combined with 1-D filter solves the complex data flow and control problem. A prototype chip is implemented using Artisan standard CMOS cell library with TSMC 0.18um 1P6M technology. The total gate count is about 217K synthesized at 120 MHz. It can support H.264/MPEG-4 AVC decoding in baseline profile level 4.1 with five reference frames. The maximum processing capability is 246K macroblocks per second or 2048x1024 4:2:0 30Hz video. Totally about 10 Kbytes on-chip memory and 16 Mbytes off-chip memory are required. The core size is 2.19x2.19 mm2. The average power dissipation is 186.4 mW when operating at 120 MHz with 1.8 V power supply. Compared to other H.264 decoder works, the proposed design requires less gate count and less on-chip memory. Therefore it is a good choice to be integrated into high definition video decoding applications. When the specification is down to QCIF (176x144), 15Hz video, our chip can deliver real-time decoding at 725 KHz with 1.8 V power supply and only consumes power of 1.18 mW. This low power feature makes our design also suitable for the mobile applications.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.