Abstract

The world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that combines the challenges of class-imbalanced data streams and concept drift. In this paper, we propose the very fast continuous synthetic minority oversampling technique (VFC-SMOTE). It is a novel meta-strategy to be prepended to any streaming machine learning classification algorithm aiming at oversampling the minority class using a new version of Smote and Borderline-Smote inspired by Data Sketching. We benchmarked VFC-SMOTE pipelines on synthetic and real data streams containing different concept drifts, imbalance levels, and class distributions. We bring statistical evidence that VFC-SMOTE pipelines learn models whose minority class performances are better than state-of-the-art. Moreover, we analyze the time/memory consumption and the concept drift recovery speed.

Highlights

  • Data abound as a multitude of smart devices produce massive, continuous, unbounded and non-stationary flows of data, namely data streams

  • We compare the Streaming Machine Learning (SML)+ methods to the ARF, HAT, Naïve Bayes (NV), KNearest Neighbor (KNN), and SWT algorithms prepended by the VFC-SMOTE meta-strategy in terms of statistical tests, time and memory consumed and recovery speed from concept drift occurrence

  • We maintain the separation between synthetic and real ones, which we carry out a Nemenyi test (Demsar 2006) with significance level α = 0.05 to compare the SML+ model performances with the performances achieved by VFC-SMOTE*

Read more

Summary

Introduction

Data abound as a multitude of smart devices produce massive, continuous, unbounded and non-stationary flows of data, namely data streams. Data streams are different from the batches of data used to train traditional Machine Learning (ML) models. Since all the observations are known in advance, it is possible to iterate over them multiple times or to split them into training and testing sets or Responsible editor: Annalisa Appice, Sergio Escalera, Jose A. Della Valle inspecting their characteristics, e.g. class imbalance ratio. In case of data streams, new samples arrive unceasingly over time as mini-batches or even only one at a time. It is impossible to iterate over data streams multiple times or to split them into training and testing sets or to inspect their characteristics. The traditional/batch-oriented ML techniques cannot be used

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.