Abstract

The `Big Data' of yesterday is the `data' of today. As technology progresses, new challenges arise and new solutions are developed. Due to the emergence of Internet of Things applications within the last decade, the field of Data Mining has been faced with the challenge of processing and analysing data streams in real-time, and under high data throughput conditions. This is often referred to as the Velocity aspect of Big Data. Whereas there are numerous reviews on Data Stream Mining techniques and applications, there is very little work surveying Data Stream processing paradigms and associated technologies, from data collection through to pre-processing and feature processing, from the perspective of the user, not that of the service provider. In this article, we evaluate a particular type of solution, which focuses on streaming data, and processing pipelines that permit online analysis of data streams that cannot be stored as-is on the computing platform. We review foundational computational concepts such as distributed computation, fault-tolerant computing, and computational paradigms/architectures. We then review the available technological solutions, and applications that pertain to data stream mining as case studies of these theoretical concepts. We conclude with a discussion of the field of data stream processing/analytics, future directions and research challenges.

Highlights

  • Stemming from recent technological advancement, what came to be coined the ‘Data Era’ [1]–[3] is concurrent to a dramatic increase in the portability of computerized devices

  • COMPUTE PARADIGMS far, we have introduced the concepts of distributed computing, and Big Data appliances as a pool of computational power

  • The main conclusion that can be drawn from this review is that ‘Big Data’ and real-time analytics is a highly complex and interdisciplinary field that requires diverse expertise in network communication, IT infrastructure, storage, control and optimisation; this expertise is required even before one can begin to plan the types of processing and analytical pipelines that could yield return on investment from the data available

Read more

Summary

INTRODUCTION

Stemming from recent technological advancement, what came to be coined the ‘Data Era’ [1]–[3] is concurrent to a dramatic increase in the portability of computerized devices. Many solutions offer only slight variations to each other within the same processing paradigm, and are most often based on outdated scientific publications with limited relevance to the state of the art As they are initially created to answer a specific problem, these solutions have their own innovations and operative modes (Apache Kafka/Samza by LinkedIn, OpenStack by Rackspace Hosting and NASA, Apache FlumeJava/Millwheel/Beam by Google, etc), but as they are developed further, they may be extended to operate outside their original specifications, regardless of whether they can excel and respond to the specific needs of their creators in their modified state.

COMPUTATIONAL CONCEPTS
FULL STACK AND CLOUD COMPUTING
Findings
CONCLUDING REMARKS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call