Abstract

In the age of the Internet of Things and social media platforms, huge amounts of digital data are generated by and collected from many sources, including sensors, mobile devices, wearable trackers and security cameras. This data, commonly referred to as Big Data, is challenging current storage, processing, and analysis capabilities. New models, languages, systems and algorithms continue to be developed to effectively collect, store, analyze and learn from Big Data. Most of the recent surveys provide a global analysis of the tools that are used in the main phases of Big Data management (generation, acquisition, storage, querying and visualization of data). Differently, this work analyzes and reviews parallel and distributed paradigms, languages and systems used today to analyze and learn from Big Data on scalable computers. In particular, we provide an in-depth analysis of the properties of the main parallel programming paradigms (MapReduce, workflow, BSP, message passing, and SQL-like) and, through programming examples, we describe the most used systems for Big Data analysis (e.g., Hadoop, Spark, and Storm). Furthermore, we discuss and compare the different systems by highlighting the main features of each of them, their diffusion (community of developers and users) and the main advantages and disadvantages of using them to implement Big Data analysis applications. The final goal of this work is to help designers and developers in identifying and selecting the best/appropriate programming solution based on their skills, hardware availability, application domains and purposes, and also considering the support provided by the developer community.

Highlights

  • Over the last years, with the development of the Internet of Things, the growth of social networks and the widespread diffusion of mobile devices, enormous amounts of digital data are being generated by and gathered from several sources

  • To extract valuable information from the analysis of such data, novel architectures, programming models and systems have been developed in the last years that address their complexity and/or high velocity [6, 7]

  • Sequential data analysis algorithms are not feasible for extracting useful models and patterns from huge volumes of data in a reasonable time. High performance computers, such as many and multi-core systems, Clouds, and multi-clusters, along with parallel and distributed algorithms and systems are required by data scientists to tackle Big Data issues [9]

Read more

Summary

Introduction

With the development of the Internet of Things, the growth of social networks and the widespread diffusion of mobile devices, enormous amounts of digital data are being generated by and gathered from several sources. Apache Pig is another Hadoop-based framework that exploits a SQL-like language for executing data flow applications in large-scale infrastructures It was originally developed for easing the development of Big Data analysis applications, allowing programmers to develop a data analysis application through a scripting and procedural data flow https://www.tensorflow.org/. Thanks to the introduction of YARN (Yet Another Resource Negotiator) in 2013, Hadoop turns from a batch processing solution into a reference platform for several other programming systems, such as: Storm for streaming data analysis; Hama for graph analysis; Hive for querying large datasets; HBase for random and real-time read/write access to data in a non-relational model; Oozie, for managing Hadoop jobs; Ambari for provisioning, managing, and monitoring Hadoop clusters; ZooKeeper for maintaining configuration information, naming, and providing distributed synchronization and group services; and more. The current version, MPI-4, provides extensions to better support hybrid programming models and fault tolerance

Programming example
Data querying and reporting
Spark Storm Hama
System Advantages
Message ordering not guaranteed
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.