Abstract

Mining data streams is a core element of Big Data Analytics. It represents the velocity of large datasets, which is one of the four aspects of Big Data, the other three being volume, variety and veracity. As data streams in, models are constructed using data mining techniques tailored towards continuous and fast model update. The Hoeffding Inequality has been among the most successful approaches in learning theory for data streams. In this context, it is typically used to provide a statistical bound for the number of examples needed in each step of an incremental learning process. It has been applied to both classification and clustering problems. Despite the success of the Hoeffding Tree classifier and other data stream mining methods, such models fall short of explaining how their results (i.e., classifications) are reached (black boxing). The expressiveness of decision models in data streams is an area of research that has attracted less attention, despite its paramount of practical importance. In this paper, we address this issue, adopting Hoeffding Inequality as an upper bound to build decision rules which can help decision makers with informed predictions (white boxing). We termed our novel method Hoeffding Rules with respect to the use of the Hoeffding Inequality in the method, for estimating whether an induced rule from a smaller sample would be of the same quality as a rule induced from a larger sample. The new method brings in a number of novel contributions including handling uncertainty through abstaining, dealing with continuous data through Gaussian statistical modelling, and an experimentally proven fast algorithm. We conducted a thorough experimental study using benchmark datasets, showing the efficiency and expressiveness of the proposed technique when compared with the state-of-the-art.

Highlights

  • One problem the research area of ‘Big Data Analytics’ is concerned with is the analysis of high velocity data, known as streaming data [1, 2], that challenge our computational resources

  • The research presented in this paper is motivated by the fact that rulebased data stream classification models are more expressive than other models, such as decision tree models, instance based models and probabilistic models

  • Inducing a classifier on data streams has some unique challenges 800 compared with data mining from batch data, as the pattern encoded in the stream may change over time which is known as concept drift

Read more

Summary

Introduction

One problem the research area of ‘Big Data Analytics’ is concerned with is the analysis of high velocity data, known as streaming data [1, 2], that challenge our computational resources. As accuracy has been the dominating measure of interest in comparing classifiers in both static and streaming environments, it is evident that real-time deci sion making based on streaming models still suffers from the issue of trust [17]. To address this issue, the user is able to determine an accuracy loss band (ζ), such that the model can be expressive enough to grant trust, and at the same time the accuracy can be tolerated at (−ζ%) of any other best performing classifier which is less expressive (can be a total black box).

Related Work
Hoeffding Rules
Probability Density Distribution for Expressive Continuous Rule Terms
Using the Hoeffding Bound to Ensure Quality of Learnt Rules from a
Overall Learning Process of Hoeffding Rules
Experimental Evaluation and Discussion
Datasets
Abstaining from Classification
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.