A Decoupling and Bidirectional Resampling Method for Multilabel Classification of Imbalanced Data with Label Concurrence

Shuyue Zhou,Hao Xu,Xiaobo Li,Yihong Dong

doi:10.1155/2020/8829432

Shuyue Zhou, Hao Xu + Show 2 more

Open Access

https://doi.org/10.1155/2020/8829432

Copy DOI

Abstract

Label imbalance is one of the characteristics of multilabel data, and imbalanced data seriously affects the performance of the classifiers. In multilabel classification, resampling methods are mostly used to deal with imbalanced problems. Existing resampling methods balance the data by either undersampling or oversampling, which causes overfitting and information loss. Resampling has a significant impact on the minority labels. Furthermore, the high concurrency of majority labels and minority labels in many instances also affects the performance of classification. In this study, we proposed a bidirectional resampling method to decouple multilabel datasets. On one hand, the concurrency of labels can be reduced by setting termination conditions for decoupling, and on the other hand, the loss of instance information and overfitting can be alleviated by combining oversampling and undersampling. By measuring the minority labels of the instances, the instances that have less impact on minority labels are selected to resample. The number of resampling is limited to keep the original distribution of the data during the resampling phase. The experiments on seven benchmark multilabel datasets have proved the effectiveness of the algorithm, especially on datasets with high concurrency of majority labels and minority labels.

Highlights

With the advent of the era of big data, data classification has received much attention in recent years. e imbalanced data often occurs in the field of data classification, including medical data
In the field of tumor classification, nontumor patients are the majority class, while tumor patients are the minority class [1], but we are more concerned about the minority of tumor patients. ese problems exist in the fields of medical imaging classification, credit card fraud [2] detection, and network intrusion identification, etc
Multilabel Decoupling Bidirectional Resampling algorithm (ML-DBR) calculates the SCUMBLEIns value for each instance in the dataset, sets the initial SCUMBLE(D) of the dataset as SCUMBLE(D)1, and decouples the instances that meet the requirements according to the SCUMBLE(D)1, so as to reduce the instances with highly concurrent labels. if SCUMBLEIns(i) > SCUMBLE(D)1, clone the instance Di as D′i, Li is the label set of Di, L′i is the label set of D′i, L′i Li[Imbalance Ratio per Label (IR)(y)≥Mean Imbalance Ratio (MeanIR)], Li Li [IR(y)≤MeanIR]. en, when every 1% of the instances in the dataset are decoupled, the SCUMBLE(D) of the uncoupled dataset is recalculated

Summary

Research Article

A Decoupling and Bidirectional Resampling Method for Multilabel Classification of Imbalanced Data with Label Concurrence. Label imbalance is one of the characteristics of multilabel data, and imbalanced data seriously affects the performance of the classifiers. Existing resampling methods balance the data by either undersampling or oversampling, which causes overfitting and information loss. The high concurrency of majority labels and minority labels in many instances affects the performance of classification. The concurrency of labels can be reduced by setting termination conditions for decoupling, and on the other hand, the loss of instance information and overfitting can be alleviated by combining oversampling and undersampling. E experiments on seven benchmark multilabel datasets have proved the effectiveness of the algorithm, especially on datasets with high concurrency of majority labels and minority labels By measuring the minority labels of the instances, the instances that have less impact on minority labels are selected to resample. e number of resampling is limited to keep the original distribution of the data during the resampling phase. e experiments on seven benchmark multilabel datasets have proved the effectiveness of the algorithm, especially on datasets with high concurrency of majority labels and minority labels

Introduction

Results and Discussion

Value Value