An approximate kernel for the discrete cosine transform (DCT) of length 4 is derived from the 4-point DCT defined by the High Efficiency Video Coding (HEVC) standard and used for the computation of DCT and inverse DCT (IDCT) of power-of-two lengths. There are two reasons for considering the DCT of length 4 as the basic module. First, it allows computation of DCTs of lengths 4, 8, 16, and 32 prescribed by the HEVC. Second, the DCTs generated by the 4-point DCT not only involve lower complexity, but also offer better compression performance. Fully parallel and area-constrained architectures for the proposed approximate DCT are proposed to have flexible tradeoff between the area and time complexities. In addition, a reconfigurable architecture is proposed where an 8-point DCT can be used in place of a pair of 4-point DCTs. Using the same reconfiguration scheme, a 32-point DCT could be configured for parallel computation of two 16-point DCTs or four 8-point DCTs or eight 4-point DCTs. The proposed reconfigurable design can support real-time coding for high-definition video sequences in the 8k ultrahigh-definition television format ( $7680\times 4320$ at 30 frames/s). A unified forward and inverse transform architecture is also proposed where the hardware complexity is reduced by sharing hardware between the DCT and IDCT computations. The proposed approximation has nearly the same arithmetic complexity and hardware requirement as those of recently proposed related methods, but involves significantly less error energy and offers better peak signal-to-noise ratio than the others when DCTs of length more than 8 are used. A detailed comparison of the complexity, energy efficiency, and compression performance of different DCT approximation schemes for video coding is also presented. It is shown that the proposed approximation provides a better compressed-image quality than other approximate DCTs. The proposed method can perform HEVC-compliant video coding with marginal degradation of video quality and a slight increase the in bit rate, with a fraction of computational complexity of the latter.