Abstract

Data analysis has become a challenge in recent years as the volume of data generated has become difficult to manage, therefore more hardware and software resources are needed to store and process this huge amount of data. Apache Hadoop is a free framework, widely used thanks to the Hadoop Distributed Files System (HDFS) and its ability to relate to other data processing and analysis components such as MapReduce for processing data, Spark - in-memory Data Processing, Apache Drill - SQL on Hadoop, and many other. In this paper, we analyze the Hadoop framework implementation making a comparative study between Single-node and Multi-node cluster on Hadoop. We will explain in detail the two layers at the base of the Hadoop architecture: HDFS Layer with its deamons NameNode, Secondary NameNode, DataNodes and MapReuce Layer with JobTrackers, TaskTrackers daemons. This work is part of a complex one aiming to perform data processing in Data Lake structures.

Highlights

  • Before the term Big Data, appeared about 15 years ago, there were few possibilities to process terabytes of data sets or higher

  • After installing, configuring, and starting ssh processes of a Singlenode cluster, launching the jps command, which is a java virtual machine process status tool, we can see the status of all Hadoop daemons like NameNode, Secondary NameNode, JobTracker, TaskTracker, DataNodes that are currently running on the machine

  • All Hadoop daemons NameNode, DataNode, Secondary NameNode, JobTracker, TaskTracker runs on one single machine

Read more

Summary

INTRODUCTION

Before the term Big Data, appeared about 15 years ago, there were few possibilities to process terabytes of data sets or higher. Doug Cutting had begun working on a new open-source implementation based on the ideas suggested by Google, so Hadoop was born. Hadoop is a distributed processing software framework that can process both small and large volumes of data across clusters of computers. It is recommended for large data sets, because it is able to scale-up from a single server to hundreds. Data are read in parallel and the time required for this operation is substantially reduced Another important specific feature of Hadoop is that it is based on the "write once and read many times" technology.

RELATED WORKS
HADOOP ARCHITECTURE
HADOOP
Single-Node Cluster
Multi-Node Cluster
CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.