Abstract
Processing a data stream in real time is a crucial issue for several applications, however processing a large amount of data from different sources, such as sensor networks, web traffic, social media, video streams and other sources, represents a huge challenge. The main problem is that the big data system is based on Hadoop technology, especially MapReduce for processing. This latter is a high scalability and fault tolerant framework. It also processes a large amount of data in batches and provides perception blast insight of older data, but it can only process a limited set of data. MapReduce is not appropriate for real time stream processing, and is very important to process data the moment they arrive at a fast response and a good decision making. Ergo the need for a new architecture that allows real-time data processing with high speed along with low latency. The major aim of the paper at hand is to give a clear survey of the different open sources technologies that exist for real-time data stream processing including their system architectures. We shall also provide a brand new architecture which is mainly based on previous comparisons of real-time processing powered with machine learning and storm technology.
Highlights
With the exponential growth of the interconnected world to the internet, a very large amount of data is produced coming in a form of continuous streams from several sources such as sensor networks, search engines, e-mail clients, social networks, e-commerce, computer logs, etc
In this paper we present an overview of some fundamental notions of big data, stream processing and the increasing volume of data
Each stream is being represented by a DStream, It is a transformation of a stream to obtain another stream, merging of several streams into one, joining of stream, joining between a stream and a single resilient distributed datasets (RDDs), filtering a stream from another stream, updating a state from a stream, and so on
Summary
The main problem is that the big data system is based on Hadoop technology, especially MapReduce for processing. This latter is a high scalability and fault tolerant framework. It processes a large amount of data in batches and provides perception blast insight of older data, but it can only process a limited set of data. The major aim of the paper at hand is to give a clear survey of the different open sources technologies that exist for real-time data stream processing including their system architectures. We shall provide a brand new architecture which is mainly based on previous comparisons of real-time processing powered with machine learning and storm technology
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Advanced Computer Science and Applications
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.