Programming big data analysis: principles and solutions

Loris Belcastro,Riccardo Cantini,Fabrizio Marozzo,Alessio Orsino,Paolo Trunfio,Domenico Talia

doi:10.1186/s40537-021-00555-2

Abstract

In the age of the Internet of Things and social media platforms, huge amounts of digital data are generated by and collected from many sources, including sensors, mobile devices, wearable trackers and security cameras. This data, commonly referred to as Big Data, is challenging current storage, processing, and analysis capabilities. New models, languages, systems and algorithms continue to be developed to effectively collect, store, analyze and learn from Big Data. Most of the recent surveys provide a global analysis of the tools that are used in the main phases of Big Data management (generation, acquisition, storage, querying and visualization of data). Differently, this work analyzes and reviews parallel and distributed paradigms, languages and systems used today to analyze and learn from Big Data on scalable computers. In particular, we provide an in-depth analysis of the properties of the main parallel programming paradigms (MapReduce, workflow, BSP, message passing, and SQL-like) and, through programming examples, we describe the most used systems for Big Data analysis (e.g., Hadoop, Spark, and Storm). Furthermore, we discuss and compare the different systems by highlighting the main features of each of them, their diffusion (community of developers and users) and the main advantages and disadvantages of using them to implement Big Data analysis applications. The final goal of this work is to help designers and developers in identifying and selecting the best/appropriate programming solution based on their skills, hardware availability, application domains and purposes, and also considering the support provided by the developer community.

Highlights

Over the last years, with the development of the Internet of Things, the growth of social networks and the widespread diffusion of mobile devices, enormous amounts of digital data are being generated by and gathered from several sources
To extract valuable information from the analysis of such data, novel architectures, programming models and systems have been developed in the last years that address their complexity and/or high velocity [6, 7]
Sequential data analysis algorithms are not feasible for extracting useful models and patterns from huge volumes of data in a reasonable time. High performance computers, such as many and multi-core systems, Clouds, and multi-clusters, along with parallel and distributed algorithms and systems are required by data scientists to tackle Big Data issues [9]

Summary

Introduction

With the development of the Internet of Things, the growth of social networks and the widespread diffusion of mobile devices, enormous amounts of digital data are being generated by and gathered from several sources. Apache Pig is another Hadoop-based framework that exploits a SQL-like language for executing data flow applications in large-scale infrastructures It was originally developed for easing the development of Big Data analysis applications, allowing programmers to develop a data analysis application through a scripting and procedural data flow https://www.tensorflow.org/. Thanks to the introduction of YARN (Yet Another Resource Negotiator) in 2013, Hadoop turns from a batch processing solution into a reference platform for several other programming systems, such as: Storm for streaming data analysis; Hama for graph analysis; Hive for querying large datasets; HBase for random and real-time read/write access to data in a non-relational model; Oozie, for managing Hadoop jobs; Ambari for provisioning, managing, and monitoring Hadoop clusters; ZooKeeper for maintaining configuration information, naming, and providing distributed synchronization and group services; and more. The current version, MPI-4, provides extensions to better support hybrid programming models and fault tolerance

Programming example

Data querying and reporting

Spark Storm Hama

System Advantages

Message ordering not guaranteed

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Jan 6, 2022
Citations: 34	License type: open-access

R Discovery Prime

R Discovery Prime

Programming big data analysis: principles and solutions

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Twitter Archives and the Challenges of "Big Social Data" for Media and Communication Research
Jean Burgess ... Axel Bruns
M/C Journal | VOL. 15
Jean Burgess, et. al.Jean Burgess ... Axel Bruns
11 Oct 2012
M/C Journal | VOL. 15

Big Social Data Approaches in Internet Studies: The Case of Twitter
Axel Bruns
-
Axel BrunsAxel Bruns
01 Jan 2018
01 Jan 2018

Visualization of Big Data with Augmented Reality
Laxmi Sharma ... Sharath Anand
-
Laxmi Sharma, et. al.Laxmi Sharma ... Sharath Anand
06 May 2021
06 May 2021

Beyond simple charts: Design of visualizations for big health data.
Oluwakemi Ola ... Kamran Sedig
Online Journal of Public Health Informatics | VOL. 8
Oluwakemi Ola, et. al.Oluwakemi Ola ... Kamran Sedig
28 Dec 2016
Online Journal of Public Health Informatics | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Programming big data analysis: principles and solutions

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data