Abstract

Developing in Big Data applications become very important in the last few years, many organizations and industries are aware that data analysis is becoming an important factor to be more competitive and discover new trends and insights. Data ingestion and preparation step is the starting point for developing any Big Data project. This paper is a review for some of the most widely used Big Data ingestion and preparation tools, it discusses the main features, advantages and usage for each tool. The purpose of this paper is to help users to select the right ingestion and preparation tool according to their needs and applications’ requirements.

Highlights

  • In recent years the data is growing quickly, multiple sources such as computers, social media and mobile phones are generating large volume of data with different format, namely structured, semi-structured and unstructured

  • Data ingestion process is an important step in building any big data project, it is frequently discussed with ETL concept which is extract, transform, and load

  • The number of smart and IOT devices are increasing rapidly, so the volume and format of the generated data are increasing and this will be considered as the biggest challenge of big data ingestion as the business needs to read the large volume of generated data in acceptable speed

Read more

Summary

Introduction

In recent years the data is growing quickly, multiple sources such as computers, social media and mobile phones are generating large volume of data with different format, namely structured, semi-structured and unstructured. Data ingestion process is an important step in building any big data project, it is frequently discussed with ETL concept which is extract, transform, and load. This paper discussed the Big Data ingestion process with different tools for batch and stream ingestion such as Sqoop, NIFI, Flume and Kafka. Each tool is discussed with its’ features, architecture and real use case It has a comparison for big data ingestion tools based in different criteria, this comparison will help users to choose the tool that satisfies their needs. Section three presented the big data ingestion concept, parameters and challenges, it reviewed some of the ingestion tools categorized based on ingestion type either batch or stream, and it discussed details about each tool. Section four introduced the data preparation process which is pre-processing step for data quality enhancement, and mentioned some tools for data preparation with its main characteristics and real use case

Data Source
Data Ingestion Parameters
Data Ingestion Challenges
Batch Data Ingestion
Sqoop Apache
NIFI Apache
Stream Data Ingestion
Flume Apache
Data Preparation
Hive Apache
Impala Apache
Storm Apache
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call