Analysis of Big Data Storage Tools for Data Lakes based on Apache Hadoop Platform

Vladimir Belov,Evgeny Nikulchev

doi:10.14569/ijacsa.2021.0120864

Abstract

When developing large data processing systems, the question of data storage arises. One of the modern tools for solving this problem is the so-called data lakes. Many implementations of data lakes use Apache Hadoop as a basic platform. Hadoop does not have a default data storage format, which leads to the task of choosing a data format when designing a data processing system. To solve this problem, it is necessary to proceed from the results of the assessment according to several criteria. In turn, experimental evaluation does not always give a complete understanding of the possibilities for working with a particular data storage format. In this case, it is necessary to study the features of the format, its internal structure, recommendations for use, etc. The article describes the features of both widely used data storage formats and the currently gaining popularity.

Highlights

One of the most important tasks of any systems for data processing is a problem of storing the data received
The aim of this paper is to analysis the formats used for data storing and processing in data lakes based on Apache Hadoop platform, their features, and possibilities in application for various tasks, such as analytics, streaming, etc
The conventionally described data storage formats can be divided into groups containing alternative formats, depending on the tasks assigned to these formats when they are used in big data processing systems

Summary

INTRODUCTION

One of the most important tasks of any systems for data processing is a problem of storing the data received. New formats of data storage are gaining popularity, such as Apache Hudi [19], Apache Iceberg [20], Delta Lake [21] Each of these file formats has own features in file structure. The aim of this paper is to analysis the formats used for data storing and processing in data lakes based on Apache Hadoop platform, their features, and possibilities in application for various tasks, such as analytics, streaming, etc. A misunderstanding of the features of the structure and recommendations for the use of tools for storing data can lead to problems at the stage of data processing systems maintenance The article describes both well-known and widely used formats for storing big data, as well as new formats that are gaining popularity now. Challenges section explores emerging storage trends for building data lakes

BIG DATA STORAGE FORMATS

Column-oriented Formats

Row-oriented Formats

CHALLENGES

Apache Iceberg

Delta Lake

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2021
Citations: 3	License type: cc-by

R Discovery Prime

R Discovery Prime

Analysis of Big Data Storage Tools for Data Lakes based on Apache Hadoop Platform

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Similar Papers

Experimental Characteristics Study of Data Storage Formats for Data Marts Development within Data Lakes
Evgeny Nikulchev ... Vladimir Belov
Applied Sciences | VOL. 11
Evgeny Nikulchev, et. al.Evgeny Nikulchev ... Vladimir Belov
17 Sep 2021
Applied Sciences | VOL. 11

The concept of an intelligent data lake management system: machine consciousness and a universal data model
Anna S Zenger ... Alyona K Tsvetkova
Procedia Computer Science | VOL. 213
Anna S Zenger, et. al.Anna S Zenger ... Alyona K Tsvetkova
01 Jan 2021
Procedia Computer Science | VOL. 213

Data Lakes: A Panacea for Big Data Problems, Cyber Safety Issues, and Enterprise Security
Mohiuddin Ahmed ... Abu Barkat Ullah
-
Mohiuddin Ahmed, et. al.Mohiuddin Ahmed ... Abu Barkat Ullah
25 Feb 2022
25 Feb 2022

Cloud DATA LAKE: The new trend of data storage
Elisabeta Zagan ... Mirela Danubianu
-
Elisabeta Zagan, et. al.Elisabeta Zagan ... Mirela Danubianu
11 Jun 2021
11 Jun 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Analysis of Big Data Storage Tools for Data Lakes based on Apache Hadoop Platform

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications