Matrix sketching for supervised classification with imbalanced classes

Roberta Falcone,Angela Montanari,Laura Anderlucci

doi:10.1007/s10618-021-00791-3

Roberta Falcone, Angela Montanari + Show 1 more

Open Access

https://doi.org/10.1007/s10618-021-00791-3

Copy DOI

Journal: Data Mining and Knowledge Discovery	Publication Date: Oct 17, 2021
Citations: 2	License type: open-access

Affiliation: University of Bologna

Abstract

The presence of imbalanced classes is more and more common in practical applications and it is known to heavily compromise the learning process. In this paper we propose a new method aimed at addressing this issue in binary supervised classification. Re-balancing the class sizes has turned out to be a fruitful strategy to overcome this problem. Our proposal performs re-balancing through matrix sketching. Matrix sketching is a recently developed data compression technique that is characterized by the property of preserving most of the linear information that is present in the data. Such property is guaranteed by the Johnson-Lindenstrauss’ Lemma (1984) and allows to embed an n-dimensional space into a reduced one without distorting, within an epsilon -size interval, the distances between any pair of points. We propose to use matrix sketching as an alternative to the standard re-balancing strategies that are based on random under-sampling the majority class or random over-sampling the minority one. We assess the properties of our method when combined with linear discriminant analysis (LDA), classification trees (C4.5) and Support Vector Machines (SVM) on simulated and real data. Results show that sketching can represent a sound alternative to the most widely used rebalancing methods.

Highlights

In many practical contexts, observations have to be classified into two classes of remarkably distinct size
As the sketched data are obtained through random linear combinations of the original ones, most of the linear information is preserved after sketching. This means that, in the imbalanced data case, the size of the majority class can be reduced through sketching without incurring the risk of losing linear information
The Gaussian under-sketching with Linear Discriminant Analysis (LDA) proved to behave significantly better than ROSE and US, while the superiority is less evident with Adasyn, SMOTE and Bal-USOS; there is no relevant improvement in terms of area under the curve (AUC) with respect to the imbalanced case and OS

Summary

Introduction

Observations have to be classified into two classes of remarkably distinct size. According to previous literature results (see, e.g. Domingos 1999; Branco et al 2016), under-sampling the majority class leads to better classifier performance than over-sampling, and combining the two does not produce much improvement with respect to simple under-sampling They design an over-sampling approach which creates synthetic examples (Synthetic Minority Over-sampling Technique - SMOTE) rather than over-sampling with replacement. We propose to address the imbalanced class issue through matrix sketching, a recently developed data transformation technique It allows to reduce the size of the majority class or to increase the size of the minority one, while preserving the linear information that is present in the original data and performing data perturbation at the same time.

Matrix sketching

Rebalancing through sketching

Result

Compute the sketched discriminant direction as:

Empirical results

Simulated data

Real data

Assessment and comparison of the re-balancing methods

Discussion and conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Matrix sketching for supervised classification with imbalanced classes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data Mining and Knowledge Discovery

Lead the way for us

Similar Papers

Scores of amino acid 0D-3D information as applied in cleavage site prediction and better specificity elucidation for human immunodeficiency virus type 1 protease
Lifang Kang ... Shanbin Yang
Science in China Series B | VOL. 51
Lifang Kang, et. al.Lifang Kang ... Shanbin Yang
17 Jul 2008
Science in China Series B | VOL. 51

A mechatronics platform to study prosthetic hand control using EMG signals.
P Geethanjali
Australasian physical & engineering sciences in medicine | VOL. 39
P GeethanjaliP Geethanjali
09 Jun 2016
Australasian physical & engineering sciences in medicine | VOL. 39

Evaluation of linear discriminant and support vector machine classifiers for classification of nitrogen status in mature oil palm from SPOT-6 satellite images: analysis of raw spectral bands and spectral indices
Amiratul Diyana Amirruddin ... Farrah Melissa Muharam
Geocarto International | VOL. 34
Amiratul Diyana Amirruddin, et. al.Amiratul Diyana Amirruddin ... Farrah Melissa Muharam
07 Feb 2018
Geocarto International | VOL. 34

Classification of natural estrogen-like isoflavonoids and diphenolics by QSAR tools.
Feng Luan ... Huitao Liu
Combinatorial chemistry & high throughput screening | VOL. 18
Feng Luan, et. al.Feng Luan ... Huitao Liu
04 Sep 2015
Combinatorial chemistry & high throughput screening | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Matrix sketching for supervised classification with imbalanced classes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data Mining and Knowledge Discovery