Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark

Vladimir Belov,Evgeny Nikulchev,Andrey Tatarintsev

doi:10.3390/sym13020195

Vladimir Belov, Evgeny Nikulchev + Show 1 more

Open Access

https://doi.org/10.3390/sym13020195

Copy DOI

Abstract

One of the most important tasks of any platform for big data processing is storing the data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper describes the five most popular formats for storing big data, presents an experimental evaluation of these formats and a methodology for choosing the format. The following data storage formats will be considered: avro, CSV, JSON, ORC, parquet. At the first stage, a comparative analysis of the main characteristics of the studied formats was carried out; at the second stage, an experimental evaluation of these formats was prepared and carried out. For the experiment, an experimental stand was deployed with tools for processing big data installed on it. The aim of the experiment was to find out characteristics of data storage formats, such as the volume and processing speed for different operations using the Apache Spark framework. In addition, within the study, an algorithm for choosing the optimal format from the presented alternatives was developed using tropical optimization methods. The result of the study is presented in the form of a technique for obtaining a vector of ratings of data storage formats for the Apache Hadoop system, based on an experimental assessment using Apache Spark.

Highlights

The development of technologies that work with data have contributed to the emergence of various tools for big data processing [1]
This paper describes the five most popular formats for storing big data in the Apache Hadoop system, presents an experimental evaluation of these formats and tropical optimization methods for choosing an effective solution
This study presented is an example of the application of experimental evaluation and tropical optimization methods to find the optimal data storage format when developing a data processing and storage system using the Apache Hadoop platform and the Apache Spark framework

Summary

Introduction

The development of technologies that work with data have contributed to the emergence of various tools for big data processing [1]. Big data means such volumes of information collected from various sources, where processing using traditional methods becomes very difficult or impossible [2,3]. The development of platforms for analytical data processing has become a popular direction in the field of working with big data [5]. Such platforms are designed for processing, and for storing data. The authors do not give recommendations on the choice of an integration messaging system

Objectives

Methods

Results

Discussion

Conclusion