Abstract

Today, big data are generated from many sources, and there is a huge demand for storing, managing, processing, and querying on big data. The MapReduce model and its counterpart open source implementation Hadoop, has proven itself as the de facto solution to big data processing, and is inherently designed for batch and high throughput processing jobs. Although Hadoop is very suitable for batch jobs, there is an increasing demand for non-batch requirements like: interactive jobs, real-time queries, and big data streams. Since Hadoop is not suitable for these non-batch workloads, new solutions are proposed to these new challenges. In this article, we discussed two categories of these solutions: real-time processing, and stream processing of big data. For each category, we discussed paradigms, strengths and differences to Hadoop. We also introduced some practical systems and frameworks for each category. Finally, some simple experiments were performed to approve effectiveness of new solutions compared to available Hadoop-based solutions.

Highlights

  • The ―Big Data‖ paradigm has experienced expanding popularity recently

  • Solutions in this sector can be classified into two major categories: (i) Solutions that try to reduce the overhead of MapReduce and make it faster to enable execution of jobs in less than seconds; (ii) Solutions that focus on providing a means for real-time queries over structured and unstructured big data using new optimized approaches

  • For the case of real-time queries over big data, a comprehensive benchmark is done by the Berkeley AMP Lab [29]

Read more

Summary

Introduction

The ―Big Data‖ paradigm has experienced expanding popularity recently. The ―Big Data” term is generally used for datasets which are so huge that they cannot be processed and managed using classical solutions like Relational Data Base Systems (RDBMS). The most notable solution that is proposed for managing and processing big data is the MapReduce framework which has been initially introduced and used by Google [4]. MapReduce is designed for batch processing of large volumes of data, and it is not suitable for recent demands like real-time and online processing. We give a brief survey with focus on two new aspects: real-time processing and stream processing solutions for big data. There are numerous use cases for stream processing like: online machine learning, and continuous computation. These new trends need systems that are more elaborate and agile than the currently available MapReduce solutions like the Hadoop framework.

The MapReduce Framework
Apache Hadoop
MapReduce Extensions
Other Models
Real-Time Big Data Processing
In-Memory Computing
Real-Time Queries over Big Data
Streaming Big Data
Experimental Results
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.