Abstract

The increasing amount and size of data being handled by data analytic applications running on Hadoop has created a need for faster data processing. One of the effective methods for handling big data sizes is compression. Data compression not only makes network I/O processing faster, but also provides better utilization of resources. However, this approach defeats one of Hadoop’s main purposes, which is the parallelism of map and reduce tasks. The number of map tasks created is determined by the size of the file, so by compressing a large file, the number of mappers is reduced which in turn decreases parallelism. Consequently, standard Hadoop takes longer times to process. In this paper, we propose the design and implementation of a Parallel Compressed File Decompressor (P-Codec) that improves the performance of Hadoop when processing compressed data. P-Codec includes two modules; the first module decompresses data upon retrieval by a data node during the phase of uploading the data to the Hadoop Distributed File System (HDFS). This process reduces the runtime of a job by removing the burden of decompression during the MapReduce phase. The second P-Codec module is a decompressed map task divider that increases parallelism by dynamically changing the map task split sizes based on the size of the final decompressed block. Our experimental results using five different MapReduce benchmarks show an average improvement of approximately 80% compared to standard Hadoop.

Highlights

  • Today’s flood of data is being generated at the rate of several Terabytes (TB) or even Petabytes (PB) every hour [1]

  • The physical file split is performed on the file when it is moved from the local machine to the Hadoop Distributed File System (HDFS)

  • To reduce the network hops and to follow the rack awareness scheme, Hadoop YARN executes the map tasks on nodes where the block resides [15]. With this design of multiple datanodes working on separate chunks of the file, Hadoop achieves a high degree of parallelism

Read more

Summary

Introduction

Today’s flood of data is being generated at the rate of several Terabytes (TB) or even Petabytes (PB) every hour [1]. Vertica is a system that includes a compression engine that can process compressed data without decompression for database management systems. Parallel Database Management System (PDMS) [8] is a system that has incorporated multiple complex designs for processing and compressing data [9]. Other complex designs include the parallelization of compression while being sent over the network [10], and decompression of data while being retrieved by the system. ISSN 2277-3061 Volume Number 8 International journal of computer and technology many factors These factors include the overhead of data decompression during the job runtime and the decreased number of map tasks that reduces parallelism in Hadoop. We introduce the Parallel Compressed file Decompressor (P-Codec) to overcome these two factors and to speed up Hadoop’s current processing of compressed data.

Background and Related Works
Problem Description
P-Codec Design and Implementation
Experimental Results
GHz intel i5
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call