A Novel Ensemble Framework Based on K-Means and Resampling for Imbalanced Data

Huajuan Duan,Yongqing Wei,Hongxia Yin,Peiyu Liu

doi:10.3390/app10051684

Huajuan Duan, Yongqing Wei + Show 2 more

Open Access

https://doi.org/10.3390/app10051684

Copy DOI

Abstract

Imbalanced classification is one of the most important problems of machine learning and data mining, existing in many real datasets. In the past, many basic classifiers such as SVM, KNN, and so on have been used for imbalanced datasets in which the number of one sample is larger than that of another, but the classification effect is not ideal. Some data preprocessing methods have been proposed to reduce the imbalance ratio of data sets and combine with the basic classifiers to get better performance. In order to improve the whole classification accuracy, we propose a novel classifier ensemble framework based on K-means and resampling technique (EKR). First, we divide the data samples in the majority class into several sub-clusters using K-means, k-value is determined by Average Silhouette Coefficient, and then adjust the number of data samples of each sub-cluster to be the same as that of the minority classes through resampling technology, after that each adjusted sub-cluster and the minority class are combined into several balanced subsets, the base classifier is trained on each balanced subset separately, and finally integrated into a strong ensemble classifier. In this paper, the extensive experimental results on 16 imbalanced datasets demonstrate the effectiveness and feasibility of the proposed algorithm in terms of multiple evaluation criteria, and EKR can achieve better performance when compared with several classical imbalanced classification algorithms using different data preprocessing methods.

Highlights

Imbalanced classification is a research hotspot in the field of pattern recognition, machine learning and data mining in recent years [1], which has attracted widespread attention of many researchers.For binary classification, imbalanced datasets contain two classes of data samples, one of which has a large number of data samples, called majority class or negative class, while another class has a small number of data samples, called minority class or positive class
In order to ensure the fairness of the results, this paper adopts five-fold cross validation to divide the datasets into five parts on average, 80% of which is the training set, the rest is the test set, and the average of ten experimental results is taken as the final result
As a research hotspot of machine learning, imbalanced classification has attracted the attention of many scholars

Summary

Introduction

Imbalanced classification is a research hotspot in the field of pattern recognition, machine learning and data mining in recent years [1], which has attracted widespread attention of many researchers. In order to improve the overall classification accuracy, when dealing with imbalanced data, traditional base classifiers such as support vector machine, Naïve Bayes, and K-nearest neighbor will ignore the impact of the minority class so that they cannot be separated correctly, but the minority class is more important than the majority class because it contains more useful information. This problem is mainly solved from two aspects: data-level and algorithm-level. We propose a novel ensemble framework which uses clustering and resampling only for the majority class and retains all information of the minority class for the following training.

Related Works

The Proposed Approach EKR

K-Means Clustering

Resampling Strategy

The Ensemble Framework based on K-Means and Resampling Technology

Datasets

Evaluation

Experimental Results and Analysis

D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16

Conclusions and Future Works

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Mar 2, 2020
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Novel Ensemble Framework Based on K-Means and Resampling for Imbalanced Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Imbalance Learning and Its Application on Medical Datasets
Yachao Shao
-
Yachao ShaoYachao Shao
21 Feb 2022
21 Feb 2022

Learning from imbalanced data sets with boosting and data generation
Hongyu Guo ... Herna L Viktor
ACM SIGKDD Explorations Newsletter | VOL. 6
Hongyu Guo, et. al.Hongyu Guo ... Herna L Viktor
01 Jun 2004
ACM SIGKDD Explorations Newsletter | VOL. 6

Comparing SVM ensembles for imbalanced datasets
Vasudha Bhatnagar ... Ashish Mahabal
-
Vasudha Bhatnagar, et. al.Vasudha Bhatnagar ... Ashish Mahabal
01 Nov 2010
01 Nov 2010

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm.
Garba Abdulrauf Sharifai ... Zurinahni Zainol
Genes | VOL. 11
Garba Abdulrauf Sharifai, et. al.Garba Abdulrauf Sharifai ... Zurinahni Zainol
27 Jun 2020
Genes | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Novel Ensemble Framework Based on K-Means and Resampling for Imbalanced Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences