Effects of Design Factors of HDFS on I/O Performance

Han-Gyoo Kim

doi:10.3844/jcssp.2018.304.309

Abstract

Four major design factors of HDFS, the block size, the number of data nodes, the number of client processes and replication factor are investigated to find out the effects on the I/O performance of HDFS by performing experiments in a real physical HDFS infrastructure consisting of 64 Hadoop data nodes of Intel i9 based blades. The block size is observed to be optimal when it equals to about 1Gb or 128MB that is the amount of the data the hard disk drive device can effectively input and output for 1 second in most of today’s off-the-shelf computers. Sophisticated allocation strategy is required to determine the number of mappers and reducers as the number of data nodes increase because the overall performance is influenced in complicated manner by the number of raw data blocks of the job to be processed, the processing time of the blocks for each node and the overhead of shuffling. Experiments shows that Hadoop distributes the work properly that the number of clients does not have a significant impact as the number of clients increases. There is little delay in copying the replica because replication is done in pipelined manner although the network is overloaded.

Highlights

Optimization of HDFS Data ProcessingHadoop as Operating System for Big Data ProcessingThe Hadoop Distributed File System (HDFS) and other software systems that make up the Hadoop Ecosystem are becoming increasingly valuable as an operating system for processing Big Data (OPDi, 2017) (Big Data, 2016) (Chaudhari et al, 2018)
A lot of researches have been reported to explore the I/O performance of HDFS (Park, 2016; Shankar and Lin, 2017; Dev and Patgiri, 2014; Clemente-Castillo et al, 2018), but few experimentation studies have been reported on the optimal design factors of HDFS system such as the block size and minimum number of data nodes required
We find that Hadoop block size should increase in proportion to the effective data transfer bandwidth of the storage device that contain the block for high I/O performance

Summary

Introduction

The Hadoop Distributed File System (HDFS) and other software systems that make up the Hadoop Ecosystem are becoming increasingly valuable as an operating system for processing Big Data (OPDi, 2017) (Big Data, 2016) (Chaudhari et al, 2018). There is performance advantage through parallel processing of HDFS, a representative distributed file system supporting the Apache open source project Hadoop (Apache, 2016), efficient data distribution control is the most important design element of distributed file system. It must cope with hardware failure flexibly and ensure adequate performance through proper resource management. Hadoop processing involves hashing for mappers and shuffling and sorting for reducers as well as inputting and outputting of HDFS data blocks. Reducing the time for inputting the blocks usually increases the network traffic for shuffling among the nodes leading to the increase of shuffle time

Writing packet

Nodes 40

Discussion and Conclusion

Findings and Contribution of Our Study

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Computer Science	Publication Date: Mar 1, 2018
Citations: 3	License type: cc-by

R Discovery Prime

R Discovery Prime

Effects of Design Factors of HDFS on I/O Performance

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computer Science

Lead the way for us

Similar Papers

Realization of Chaotic Sequence Encryption Algorithm in MapReduce Distributed Parallel Model
Yixin Su ... Zhengshuang Tang
-
Yixin Su, et. al.Yixin Su ... Zhengshuang Tang
01 Jun 2018
01 Jun 2018

Where should electronic records for patients be stored?
Vijay Lapsia ... William A Yasnoff
International Journal of Medical Informatics | VOL. 81
Vijay Lapsia, et. al.Vijay Lapsia ... William A Yasnoff
25 Sep 2012
International Journal of Medical Informatics | VOL. 81

A Post-Equalizer Based on Dual Self-Attention Network in UVLC System
Fangxing Yuan ... Jiuchun Ren
IEEE Photonics Journal | VOL. 13
Fangxing Yuan, et. al.Fangxing Yuan ... Jiuchun Ren
01 Apr 2021
IEEE Photonics Journal | VOL. 13

Invited Commentary
Laurens J Ceulemans ... Paul De Leyn
The Annals of Thoracic Surgery | VOL. 98
Laurens J Ceulemans, et. al.Laurens J Ceulemans ... Paul De Leyn
01 Jul 2014
The Annals of Thoracic Surgery | VOL. 98

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Effects of Design Factors of HDFS on I/O Performance

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computer Science