Abstract

Multiclass classification in cancer diagnostics, using DNA or Gene Expression Signatures, but also classification of bacteria species fingerprints in MALDI-TOF mass spectrometry data, is challenging because of imbalanced data and the high number of dimensions with respect to the number of instances. In this study, a new oversampling technique called LICIC will be presented as a valuable instrument in countering both class imbalance, and the famous “curse of dimensionality” problem. The method enables preservation of non-linearities within the dataset, while creating new instances without adding noise. The method will be compared with other oversampling methods, such as Random Oversampling, SMOTE, Borderline-SMOTE, and ADASYN. F1 scores show the validity of this new technique when used with imbalanced, multiclass, and high-dimensional datasets.

Highlights

  • The between-class imbalance is a well-known problem that afflicts numerous datasets.The classification task become even more difficult if there are very few instances in the dataset, a few hundred for example, and when each instance is composed of thousands of dimensions

  • In [3], the authors develop the idea of SMOTE to use SVM classifiers to deal with class imbalance problems; artificial minority class instances are generated around the borderline between two data classes

  • It has shown that LICIC with Linear KPCA is useful when the number of dimensions is very high

Read more

Summary

Introduction

The between-class imbalance is a well-known problem that afflicts numerous datasets. The classification task become even more difficult if there are very few instances in the dataset, a few hundred for example, and when each instance is composed of thousands of dimensions. Important Components for Imbalanced multiclass Classification), is designed to deal with datasets that have fewer instances than the number of dimensions, and where there is a strong skewness between the number of instances of different classes. It operates in “feature-space” rather than “data space”, preserving non-linearities present in datasets. It makes use of kernel PCA [6] on the whole dataset and works in Φ(x) transformed space, effecting permutations of less important components, to create new synthetic instances for each minority class.

Literature Review
Dataset Description
Kernel Principal Components Analysis with Pre-Image Computation
Linear
LICIC Algorithm
Experiments and Results
MicroMass Dataset Results
F1-Micro
Learning
GCM Dataset Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call