Study of selected methods for balancing independent data sets in [formula omitted]-nearest neighbors classifiers with Pawlak conflict analysis

Małgorzata Przybyła-Kasperek

doi:10.1016/j.asoc.2022.109612

Abstract

The article is devoted to the issue of classification based on independent data sets. More specifically, the impact of using different methods for balancing data sets on the classification quality in an approach that uses Pawlak conflict analysis is investigated. The newly proposed method for classification based on independent data sets assumes applying algorithms for imbalanced classification separately to all fragmented sets. The following algorithms are considered: SMOTE, random over-sampling, TOMEK links, Near Miss, random under-sampling and a combination of SMOTE and TOMEK links. For balanced data sets — decision tables, conflict analysis is used, and coalitions of tables are created. Then the aggregated table for each coalition is defined, and a modified k-nearest neighbors algorithm is used to determine the decision vectors. The majority voting method is used to fuse decision vectors. Experimental results showed that the proposed approach, in most cases, gives much better results than without using methods for imbalanced data. In addition, the proposed approach achieves better results than other methods known from the literature applied to dispersed data. It was noticed that for dispersed and independent data, the best results are generated by the over-sampling approach, especially by the SMOTE, the random over-sampling and the SMOTE and TOMEK methods.

Full Text