Abstract
The ever increasing data generation confronts us with the problem of handling online massive amounts of information. One of the biggest challenges is how to extract valuable information from these massive continuous data streams during single scanning. In a data stream context, data arrive continuously at high speed; therefore the algorithms developed to address this context must be efficient regarding memory and time management and capable of detecting changes over time in the underlying distribution that generated the data. This work describes a novel method for the task of pattern classification over a continuous data stream based on an associative model. The proposed method is based on the Gamma classifier, which is inspired by the Alpha-Beta associative memories, which are both supervised pattern recognition models. The proposed method is capable of handling the space and time constrain inherent to data stream scenarios. The Data Streaming Gamma classifier (DS-Gamma classifier) implements a sliding window approach to provide concept drift detection and a forgetting mechanism. In order to test the classifier, several experiments were performed using different data stream scenarios with real and synthetic data streams. The experimental results show that the method exhibits competitive performance when compared to other state-of-the-art algorithms.
Highlights
In recent years, technological advances have promoted the generation of a vast amount of information from different areas of knowledge: sensor networks, financial data, fraud detection, and web data, among others
In this paper we describe a novel method for the task of pattern classification over a continuous data stream based on an associative model
The DS-Gamma classifier performance improved as the window size increases with the exception of Electricity data stream, for this data stream performance is better with smaller window sizes
Summary
Technological advances have promoted the generation of a vast amount of information from different areas of knowledge: sensor networks, financial data, fraud detection, and web data, among others. According to the study performed by IDC (International Data Corporation) [1], the digital universe in 2013 was estimated in 4.4 trillion gigabytes From this digital data, only 22% would be a candidate for analysis, while the available storage capacity could hold just 33% of the generated information. The algorithms developed to address this context, unlike traditional ones, must meet some constraints, as defined in [2], work with a limited amount of time, and use a limited amount of memory, and one or only few pass over the data. They should be capable of reacting to concept drift, that is, changes in the distribution of the data over time. In [3], more details of the requirements to be considered for data streams algorithms can be found
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have