A novel intuitionistic fuzzy rough instance selection and attribute reduction with kernelized intuitionistic fuzzy C-means clustering to handle imbalanced datasets

Anoop Kumar Tiwari,Abhigyan Nath,Rakesh Kumar Pandey,Priti Maratha

doi:10.1016/j.eswa.2024.124087

Abstract

Due to advancement of internet and lab based technologies, large volume of high dimensional data are generated every day. These data usually consisted of several issues such as class imbalance, noise, later uncertainty, irrelevant and/or redundant features, and redundancy in size. These issues degrade the overall performance measures of the various machine learning algorithms. An efficient method to cope with such issues for a large sized datasets is to apply efficient data reduction techniques. In the recent years, numerous data reduction techniques have been presented based on fuzzy rough set theory to tackle these obstacles. However, building such a method that can handle all the above mentioned issues simultaneously is still a challenging task. In this study, we handle all these obstacles simultaneously by introducing a novel approach to obtain the reduced dataset by combining kernelized intuitionistic fuzzy C-means with intuitionistic fuzzy rough set model. Intuitionistic fuzzy rough set handles the vagueness and uncertainty in a better way than fuzzy set aided models as it takes use of membership, non-membership as well as hesitancy to capture the uncertainty of real-valued datasets. To generate the membership and non-membership grades, kernelized intuitionistic fuzzy C- means based notion is introduced. Further, an intuitionistic fuzzy rough set model is established by addressing lower and upper approximations based on a novel similarity relation. Moreover, all the necessary conditions are justified with the relevant mathematical theorems. Next, this model is employed for dimensionality reduction based on the concept of discernibility matrix by using the idea of different classes’ ratios to avoid the noise. Thereafter, positive region is defined with the help of lower approximation. The positive region information is applied to tackle problematic instances available in both minority and majority class for the imbalanced dataset after generation of artificial samples by synthetic minority oversampling technique (SMOTE). A comprehensive experimental study is added to show the effectiveness of the proposed technique. Finally, a framework is established to improve the prediction of animal toxin peptides.

Full Text