Abstract

Four major design factors of HDFS, the block size, the number of data nodes, the number of client processes and replication factor are investigated to find out the effects on the I/O performance of HDFS by performing experiments in a real physical HDFS infrastructure consisting of 64 Hadoop data nodes of Intel i9 based blades. The block size is observed to be optimal when it equals to about 1Gb or 128MB that is the amount of the data the hard disk drive device can effectively input and output for 1 second in most of today’s off-the-shelf computers. Sophisticated allocation strategy is required to determine the number of mappers and reducers as the number of data nodes increase because the overall performance is influenced in complicated manner by the number of raw data blocks of the job to be processed, the processing time of the blocks for each node and the overhead of shuffling. Experiments shows that Hadoop distributes the work properly that the number of clients does not have a significant impact as the number of clients increases. There is little delay in copying the replica because replication is done in pipelined manner although the network is overloaded.

Highlights

  • Optimization of HDFS Data ProcessingHadoop as Operating System for Big Data ProcessingThe Hadoop Distributed File System (HDFS) and other software systems that make up the Hadoop Ecosystem are becoming increasingly valuable as an operating system for processing Big Data (OPDi, 2017) (Big Data, 2016) (Chaudhari et al, 2018)

  • A lot of researches have been reported to explore the I/O performance of HDFS (Park, 2016; Shankar and Lin, 2017; Dev and Patgiri, 2014; Clemente-Castillo et al, 2018), but few experimentation studies have been reported on the optimal design factors of HDFS system such as the block size and minimum number of data nodes required

  • We find that Hadoop block size should increase in proportion to the effective data transfer bandwidth of the storage device that contain the block for high I/O performance

Read more

Summary

Introduction

The Hadoop Distributed File System (HDFS) and other software systems that make up the Hadoop Ecosystem are becoming increasingly valuable as an operating system for processing Big Data (OPDi, 2017) (Big Data, 2016) (Chaudhari et al, 2018). There is performance advantage through parallel processing of HDFS, a representative distributed file system supporting the Apache open source project Hadoop (Apache, 2016), efficient data distribution control is the most important design element of distributed file system. It must cope with hardware failure flexibly and ensure adequate performance through proper resource management. Hadoop processing involves hashing for mappers and shuffling and sorting for reducers as well as inputting and outputting of HDFS data blocks. Reducing the time for inputting the blocks usually increases the network traffic for shuffling among the nodes leading to the increase of shuffle time

Writing packet
Nodes 40
Discussion and Conclusion
Findings and Contribution of Our Study
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.