VFC-SMOTE: very fast continuous synthetic minority oversampling for evolving data streams

Alessio Bernardo,Emanuele Della Valle

doi:10.1007/s10618-021-00786-0

Alessio Bernardo, Emanuele Della Valle

Open Access

https://doi.org/10.1007/s10618-021-00786-0

Copy DOI

Journal: Data Mining and Knowledge Discovery	Publication Date: Sep 14, 2021
Citations: 12	License type: open-access

Affiliation: Politecnico di Milano

Abstract

The world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that combines the challenges of class-imbalanced data streams and concept drift. In this paper, we propose the very fast continuous synthetic minority oversampling technique (VFC-SMOTE). It is a novel meta-strategy to be prepended to any streaming machine learning classification algorithm aiming at oversampling the minority class using a new version of Smote and Borderline-Smote inspired by Data Sketching. We benchmarked VFC-SMOTE pipelines on synthetic and real data streams containing different concept drifts, imbalance levels, and class distributions. We bring statistical evidence that VFC-SMOTE pipelines learn models whose minority class performances are better than state-of-the-art. Moreover, we analyze the time/memory consumption and the concept drift recovery speed.

Highlights

Data abound as a multitude of smart devices produce massive, continuous, unbounded and non-stationary flows of data, namely data streams
We compare the Streaming Machine Learning (SML)+ methods to the ARF, HAT, Naïve Bayes (NV), KNearest Neighbor (KNN), and SWT algorithms prepended by the VFC-SMOTE meta-strategy in terms of statistical tests, time and memory consumed and recovery speed from concept drift occurrence
We maintain the separation between synthetic and real ones, which we carry out a Nemenyi test (Demsar 2006) with significance level α = 0.05 to compare the SML+ model performances with the performances achieved by VFC-SMOTE*

Summary

Introduction

Data abound as a multitude of smart devices produce massive, continuous, unbounded and non-stationary flows of data, namely data streams. Data streams are different from the batches of data used to train traditional Machine Learning (ML) models. Since all the observations are known in advance, it is possible to iterate over them multiple times or to split them into training and testing sets or Responsible editor: Annalisa Appice, Sergio Escalera, Jose A. Della Valle inspecting their characteristics, e.g. class imbalance ratio. In case of data streams, new samples arrive unceasingly over time as mini-batches or even only one at a time. It is impossible to iterate over data streams multiple times or to split them into training and testing sets or to inspect their characteristics. The traditional/batch-oriented ML techniques cannot be used

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

VFC-SMOTE: very fast continuous synthetic minority oversampling for evolving data streams

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data Mining and Knowledge Discovery

Lead the way for us

Similar Papers

An extensive study of C-SMOTE, a Continuous Synthetic Minority Oversampling Technique for Evolving Data Streams
Alessio Bernardo ... Emanuele Della Valle
Expert Systems with Applications | VOL. 196
Alessio Bernardo, et. al.Alessio Bernardo ... Emanuele Della Valle
12 Feb 2022
Expert Systems with Applications | VOL. 196

SMOTE-OB: Combining SMOTE and Online Bagging for Continuous Rebalancing of Evolving Data Streams
Alessio Bernardo ... Emanuele Della Valle
-
Alessio Bernardo, et. al.Alessio Bernardo ... Emanuele Della Valle
15 Dec 2021
15 Dec 2021

C-SMOTE: Continuous Synthetic Minority Oversampling for Evolving Data Streams
Alessio Bernardo ... Albert Bifet
-
Alessio Bernardo, et. al.Alessio Bernardo ... Albert Bifet
10 Dec 2020
10 Dec 2020

A multi-objective ensemble method for online class imbalance learning
Shuo Wang ... Leandro L Minku
-
Shuo Wang, et. al.Shuo Wang ... Leandro L Minku
01 Jul 2014
01 Jul 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

VFC-SMOTE: very fast continuous synthetic minority oversampling for evolving data streams

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data Mining and Knowledge Discovery