The skewed class distribution and data complexity may severely affect the imbalanced classification results. The cost of classification can be significantly reduced if these data complexity are measured and pre-processed prior to training, particularly when dealing with large-scale and high-dimensional datasets. Although many methods have been proposed to evaluate data complexity, most of them fail to fully consider the interaction among different data characteristics, or the connection between class imbalance and these characteristics, thus posing a serious challenge to effectively evaluate the difficulty of classification. This paper presents a new data complexity measure MFII (multi-factor imbalance index), which measures the combined effects of the skewed class distribution and data characteristics on classification difficulty. In particular, it further comprehensively investigates the impact of overlap size, confusion degree, and sub-cluster structure. VoR (value of resolution) and DoC (degree of consistency) are also proposed to evaluate the resolution and interpretability of complexity measures. The experimental results demonstrate that MFII has lower VoR and a stronger correlation with classification metrics, which indicates that MFII can more accurately evaluate the difficulty of multi-class imbalanced classification tasks.
Read full abstract