Abstract

Discrete cosine transform (DCT) is one of the major operations in image compression standards and it requires intensive and complex computations. Recent computer systems and handheld devices are equipped with high computing capability devices such as a general-purpose graphics processing unit (GPGPU) in addition to the traditional multicores CPU. We develop an optimized parallel implementation of the forward DCT algorithm for the JPEG image compression using the recently proposed Open Computing Language (OpenCL). This OpenCL parallel implementation combines a multicore CPU and a GPGPU in a single solution to perform DCT computations in an efficient manner by applying certain optimization techniques to enhance the kernel execution time and data movements. A separate optimal OpenCL kernel code was developed (CPU-based and GPU-based kernels) based on certain appropriate device-based optimization factors, such as thread-mapping, thread granularity, vector-based memory access, and the given workload. The performance of DCT is evaluated on a heterogeneous environment and our OpenCL parallel implementation results in speeding up the execution of the DCT by the factors of 3.68 and 5.58 for different image sizes and formats in terms of workload allocations and data transfer mechanisms. The obtained speedup indicates the scalability of the DCT performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.