Abstract

An increasing volume of data puts MapReduce data analytic platforms such as Hadoop under constant resource pressure. A new two-phase text compression scheme has been specially designed to accelerate data analysis and reduce cluster resource usage, and this has been implemented for Hadoop. The scheme consists of two levels of compression. The first level compression allows a Hadoop program to consume the compressed data directly, thus reducing the data transmission cost within a cluster during analysis. The second level packages data into fixed-size blocks that respect the logical data records. This further reduces the data size to a size similar to that achieved by a higher-order entropy encoder while also making the compressed data splittable for the HDFS. The use of the compression scheme is made transparent to Hadoop developers by the provided utility functions. The compression scheme is evaluated using a set of standard MapReduce jobs for a selection of real-world datasets. The experimental results show an improvement on analysis performance of up to 72% and compression ratios close to that achieved by a standard compressor such as Bzip.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.