Abstract
Multiclass classification in cancer diagnostics, using DNA or Gene Expression Signatures, but also classification of bacteria species fingerprints in MALDI-TOF mass spectrometry data, is challenging because of imbalanced data and the high number of dimensions with respect to the number of instances. In this study, a new oversampling technique called LICIC will be presented as a valuable instrument in countering both class imbalance, and the famous “curse of dimensionality” problem. The method enables preservation of non-linearities within the dataset, while creating new instances without adding noise. The method will be compared with other oversampling methods, such as Random Oversampling, SMOTE, Borderline-SMOTE, and ADASYN. F1 scores show the validity of this new technique when used with imbalanced, multiclass, and high-dimensional datasets.
Highlights
The between-class imbalance is a well-known problem that afflicts numerous datasets.The classification task become even more difficult if there are very few instances in the dataset, a few hundred for example, and when each instance is composed of thousands of dimensions
In [3], the authors develop the idea of SMOTE to use SVM classifiers to deal with class imbalance problems; artificial minority class instances are generated around the borderline between two data classes
It has shown that LICIC with Linear KPCA is useful when the number of dimensions is very high
Summary
The between-class imbalance is a well-known problem that afflicts numerous datasets. The classification task become even more difficult if there are very few instances in the dataset, a few hundred for example, and when each instance is composed of thousands of dimensions. Important Components for Imbalanced multiclass Classification), is designed to deal with datasets that have fewer instances than the number of dimensions, and where there is a strong skewness between the number of instances of different classes. It operates in “feature-space” rather than “data space”, preserving non-linearities present in datasets. It makes use of kernel PCA [6] on the whole dataset and works in Φ(x) transformed space, effecting permutations of less important components, to create new synthetic instances for each minority class.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have