Abstract

Recently, big data streams have become ubiquitous due to the fact that a number of applications generate a huge amount of data at a great velocity. This made it difficult for existing data mining tools, technologies, methods, and techniques to be applied directly on big data streams due to the inherent dynamic characteristics of big data. In this paper, a systematic review of big data streams analysis which employed a rigorous and methodical approach to look at the trends of big data stream tools and technologies as well as methods and techniques employed in analysing big data streams. It provides a global view of big data stream tools and technologies and its comparisons. Three major databases, Scopus, ScienceDirect and EBSCO, which indexes journals and conferences that are promoted by entities such as IEEE, ACM, SpringerLink, and Elsevier were explored as data sources. Out of the initial 2295 papers that resulted from the first search string, 47 papers were found to be relevant to our research questions after implementing the inclusion and exclusion criteria. The study found that scalability, privacy and load balancing issues as well as empirical analysis of big data streams and technologies are still open for further research efforts. We also found that although, significant research efforts have been directed to real-time analysis of big data stream not much attention has been given to the preprocessing stage of big data streams. Only a few big data streaming tools and technologies can do all of the batch, streaming, and iterative jobs; there seems to be no big data tool and technology that offers all the key features required for now and standard benchmark dataset for big data streaming analytics has not been widely adopted. In conclusion, it was recommended that research efforts should be geared towards developing scalable frameworks and algorithms that will accommodate data stream computing mode, effective resource allocation strategy and parallelization issues to cope with the ever-growing size and complexity of data.

Highlights

  • Advances in information technology have facilitated large volume, high-velocity of data, and the ability to store data continuously leading to several computational challenges.Due to the nature of big data in terms of volume, velocity, variety, variability, veracity, volatility, and value [1] that are being generated recently, big data computing is a new trend for future computing.Big data computing can be generally categorized into two types based on the processing requirements, which are big data batch computing and big data stream computingKolajo et al J Big Data (2019) 6:47[2]

  • Conclusion and further work As a result of challenges and opportunities presented by the Information Technology revolution, big data streaming analytics has emerged as the new frontier of competition and innovation

  • Organisations who seize the opportunity of big data streaming analytics are provided with insights for robust decision making in real-time thereby making them to have an edge over their competitors

Read more

Summary

Introduction

Advances in information technology have facilitated large volume, high-velocity of data, and the ability to store data continuously leading to several computational challenges.Due to the nature of big data in terms of volume, velocity, variety, variability, veracity, volatility, and value [1] that are being generated recently, big data computing is a new trend for future computing.Big data computing can be generally categorized into two types based on the processing requirements, which are big data batch computing and big data stream computingKolajo et al J Big Data (2019) 6:47[2]. Big data batch processing is not sufficient when it comes to analysing real-time application scenarios. The output must be generated with low-latency and any incoming data must be reflected in the newly generated output within seconds This necessitates big data stream analysis [3]. Stream computing Stream computing refers to the processing of massive amount of data generated at highvelocity from multiple sources with low latency in real-time. It is a new paradigm necessitated because of new sources of data generating scenarios which include ubiquity of location services, mobile devices, and sensor pervasiveness [4].

Methods
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call