Hadoop Scalability and Performance Testing in Homogeneous Clusters

Chiranjeevi Manike,Tejashwini Gajulagudem,Ashok Kumar Nanda

doi:10.1007/978-3-030-30577-2_81

Abstract

Big data is a term used to refer to the datasets that are too large (Ex. GBs, TBs, PBs, ZBs, etc.) or complex for traditional data processing application software. Distributed and parallel processing becomes increasingly important for big data. There are two most popular parallel and distributed processing frameworks available, namely Hadoop and Spark. Hadoop and Spark are open-source software frameworks for reliable, scalable, and distributed computing. Hadoop is created by Apache Software Foundation. This framework allows the processing of extremely large datasets on clusters of computers using a simple programming model called MapReduce. It works on a distributed file system called HDFS (Hadoop Distributed File System) to run on commodity hardware. It is designed to scale up horizontally from a single machine to thousands of machines, each offering local computation and storage. Performance of Hadoop cluster depends on the application and several parameters. In this paper we aim to study the performance of Hadoop homogeneous cluster by tuning a few parameters like cluster size, dataset size, and HDFS block size, etc.

Full Text