Abstract

When developing large data processing systems, the question of data storage arises. One of the modern tools for solving this problem is the so-called data lakes. Many implementations of data lakes use Apache Hadoop as a basic platform. Hadoop does not have a default data storage format, which leads to the task of choosing a data format when designing a data processing system. To solve this problem, it is necessary to proceed from the results of the assessment according to several criteria. In turn, experimental evaluation does not always give a complete understanding of the possibilities for working with a particular data storage format. In this case, it is necessary to study the features of the format, its internal structure, recommendations for use, etc. The article describes the features of both widely used data storage formats and the currently gaining popularity.

Highlights

  • One of the most important tasks of any systems for data processing is a problem of storing the data received

  • The aim of this paper is to analysis the formats used for data storing and processing in data lakes based on Apache Hadoop platform, their features, and possibilities in application for various tasks, such as analytics, streaming, etc

  • The conventionally described data storage formats can be divided into groups containing alternative formats, depending on the tasks assigned to these formats when they are used in big data processing systems

Read more

Summary

INTRODUCTION

One of the most important tasks of any systems for data processing is a problem of storing the data received. New formats of data storage are gaining popularity, such as Apache Hudi [19], Apache Iceberg [20], Delta Lake [21] Each of these file formats has own features in file structure. The aim of this paper is to analysis the formats used for data storing and processing in data lakes based on Apache Hadoop platform, their features, and possibilities in application for various tasks, such as analytics, streaming, etc. A misunderstanding of the features of the structure and recommendations for the use of tools for storing data can lead to problems at the stage of data processing systems maintenance The article describes both well-known and widely used formats for storing big data, as well as new formats that are gaining popularity now. Challenges section explores emerging storage trends for building data lakes

BIG DATA STORAGE FORMATS
Column-oriented Formats
Row-oriented Formats
CHALLENGES
Apache Iceberg
Delta Lake
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call