Abstract

Abstract. ETL (Extraction-Transform-Load) tools, traditionally developed to operate offline on historical data for feeding Data-warehouses need to be enhanced to deal with continuously increased streaming data and be executed at network level during data streams acquisition. In this paper, a scalable and web-based ETL system called NMStream was presented. NMStream is based on event-driven architecture and designed for integrating distributed and heterogeneous streaming data by integrating the Apache Flume and Cassandra DB system, and the ETL processes were conducted through the Flume agent object. NMStream can be used for feeding traditional/real-time data-warehouses or data analytic tools in a stable and effective manner.

Highlights

  • The advancements of sensing technologies have dramatically improved the accuracy and spatiotemporal scope of the record

  • We proposed NMStream, a highly scalable webbased ETL framework for heterogeneous streaming data collection and ETL operations during data stream acquisition stage on a programmable network

  • We first briefly introduced the Apache Flume and Cassandra systems, and we introduced the overall architecture of NMStream as well as its core components

Read more

Summary

INTRODUCTION

The advancements of sensing technologies have dramatically improved the accuracy and spatiotemporal scope of the record. Based on Yang et al (2011), massive amounts of multi-dimensional data recording various physical phenomena are taken by the sensors across the globe, and these sensing data are collected rapidly with a daily increase rate of terabytes to petabytes. The specification and actuation of the ETL operations should be efficiently performed on-line and on fresh and timely data in order to properly handling big real-time data streams. All these technical requirements should be addressed in graphical, user-friendly environments supporting the user in the design and execution of the operations.

RELATED WORK
SYSTEM INTRODUCTION
Apache Flume
Apache Cassandra
NMStream architecture
NMSTREAM STREAMING DATA MODEL
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call