Abstract

Processing a data stream in real time is a crucial issue for several applications, however processing a large amount of data from different sources, such as sensor networks, web traffic, social media, video streams and other sources, represents a huge challenge. The main problem is that the big data system is based on Hadoop technology, especially MapReduce for processing. This latter is a high scalability and fault tolerant framework. It also processes a large amount of data in batches and provides perception blast insight of older data, but it can only process a limited set of data. MapReduce is not appropriate for real time stream processing, and is very important to process data the moment they arrive at a fast response and a good decision making. Ergo the need for a new architecture that allows real-time data processing with high speed along with low latency. The major aim of the paper at hand is to give a clear survey of the different open sources technologies that exist for real-time data stream processing including their system architectures. We shall also provide a brand new architecture which is mainly based on previous comparisons of real-time processing powered with machine learning and storm technology.

Highlights

  • With the exponential growth of the interconnected world to the internet, a very large amount of data is produced coming in a form of continuous streams from several sources such as sensor networks, search engines, e-mail clients, social networks, e-commerce, computer logs, etc

  • In this paper we present an overview of some fundamental notions of big data, stream processing and the increasing volume of data

  • Each stream is being represented by a DStream, It is a transformation of a stream to obtain another stream, merging of several streams into one, joining of stream, joining between a stream and a single resilient distributed datasets (RDDs), filtering a stream from another stream, updating a state from a stream, and so on

Read more

Summary

A New Architecture for Real Time Data Stream Processing

The main problem is that the big data system is based on Hadoop technology, especially MapReduce for processing. This latter is a high scalability and fault tolerant framework. It processes a large amount of data in batches and provides perception blast insight of older data, but it can only process a limited set of data. The major aim of the paper at hand is to give a clear survey of the different open sources technologies that exist for real-time data stream processing including their system architectures. We shall provide a brand new architecture which is mainly based on previous comparisons of real-time processing powered with machine learning and storm technology

INTRODUCTION
BIG DATA
Apache Spark
A COMPARISON OF DATA PROCESSING TECHNOLOGIES
Lambda Architecture
Kappa Architecture
PROPOSED ARCHITECTURE
COMPARISON WITH RELATED ARCHITECTURES
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call