Record-Aware Two-Level Compression for Big Textual Data Analysis Acceleration

Dapeng Dong,John Herbert

doi:10.1109/cloudcom.2015.32

Abstract

An increasing volume of data puts MapReduce data analytic platforms such as Hadoop under constant resource pressure. A new two-phase text compression scheme has been specially designed to accelerate data analysis and reduce cluster resource usage, and this has been implemented for Hadoop. The scheme consists of two levels of compression. The first level compression allows a Hadoop program to consume the compressed data directly, thus reducing the data transmission cost within a cluster during analysis. The second level packages data into fixed-size blocks that respect the logical data records. This further reduces the data size to a size similar to that achieved by a higher-order entropy encoder while also making the compressed data splittable for the HDFS. The use of the compression scheme is made transparent to Hadoop developers by the provided utility functions. The compression scheme is evaluated using a set of standard MapReduce jobs for a selection of real-world datasets. The experimental results show an improvement on analysis performance of up to 72% and compression ratios close to that achieved by a standard compressor such as Bzip.

Full Text