P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop

Idris Hanafi,Amal Abdel-Raouf

doi:10.24297/ijct.v15i8.1500

Abstract

The increasing amount and size of data being handled by data analytic applications running on Hadoop has created a need for faster data processing. One of the effective methods for handling big data sizes is compression. Data compression not only makes network I/O processing faster, but also provides better utilization of resources. However, this approach defeats one of Hadoopâ€™s main purposes, which is the parallelism of map and reduce tasks. The number of map tasks created is determined by the size of the file, so by compressing a large file, the number of mappers is reduced which in turn decreases parallelism. Consequently, standard Hadoop takes longer times to process. In this paper, we propose the design and implementation of a Parallel Compressed File Decompressor (P-Codec) that improves the performance of Hadoop when processing compressed data. P-Codec includes two modules; the first module decompresses data upon retrieval by a data node during the phase of uploading the data to the Hadoop Distributed File System (HDFS). This process reduces the runtime of a job by removing the burden of decompression during the MapReduce phase. The second P-Codec module is a decompressed map task divider that increases parallelism by dynamically changing the map task split sizes based on the size of the final decompressed block. Our experimental results using five different MapReduce benchmarks show an average improvement of approximately 80% compared to standard Hadoop.

Highlights

Today’s flood of data is being generated at the rate of several Terabytes (TB) or even Petabytes (PB) every hour [1]
The physical file split is performed on the file when it is moved from the local machine to the Hadoop Distributed File System (HDFS)
To reduce the network hops and to follow the rack awareness scheme, Hadoop YARN executes the map tasks on nodes where the block resides [15]. With this design of multiple datanodes working on separate chunks of the file, Hadoop achieves a high degree of parallelism

Summary

Introduction

Today’s flood of data is being generated at the rate of several Terabytes (TB) or even Petabytes (PB) every hour [1]. Vertica is a system that includes a compression engine that can process compressed data without decompression for database management systems. Parallel Database Management System (PDMS) [8] is a system that has incorporated multiple complex designs for processing and compressing data [9]. Other complex designs include the parallelization of compression while being sent over the network [10], and decompression of data while being retrieved by the system. ISSN 2277-3061 Volume Number 8 International journal of computer and technology many factors These factors include the overhead of data decompression during the job runtime and the decreased number of map tasks that reduces parallelism in Hadoop. We introduce the Parallel Compressed file Decompressor (P-Codec) to overcome these two factors and to speed up Hadoop’s current processing of compressed data.

Background and Related Works

Problem Description

P-Codec Design and Implementation

Experimental Results

GHz intel i5

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY

Lead the way for us

Journal: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY	Publication Date: May 24, 2016
License type: CC BY 4.0

Similar Papers

Hadoop Ecosystem and Its Analysis on Tweets
Can Uzunkaya ... Yusuf Kavurucu
Procedia - Social and Behavioral Sciences | VOL. 195
Can Uzunkaya, et. al.Can Uzunkaya ... Yusuf Kavurucu
01 Jul 2015
Procedia - Social and Behavioral Sciences | VOL. 195

An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements
Zhuozhao Li ... Walter B Ligon Iii
IEEE Transactions on Parallel and Distributed Systems | VOL. 28
Zhuozhao Li, et. al.Zhuozhao Li ... Walter B Ligon Iii
01 Jan 2015
IEEE Transactions on Parallel and Distributed Systems | VOL. 28

Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements for High Application Performance
Zhuozhao Li ... Haiying Shen
-
Zhuozhao Li, et. al.Zhuozhao Li ... Haiying Shen
01 Sep 2015
01 Sep 2015

ERP: An enhanced read policy for HDFS to improve read performance for files under construction
Junjie He ... Fei Hu
-
Junjie He, et. al. Junjie He ... Fei Hu
01 Dec 2015
01 Dec 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: INTERNATIONAL JOURNAL OF COMPUTERS &amp; TECHNOLOGY

More From: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY