Abstract

Imbalance class problem in data mining occurs where one class known as the minority class has the significantly lower number of samples than the other classes known as the majority class(es). It affects the performance of machine learning algorithms by allowing them to show bias towards the majority class. This occurs because of the sub-concepts from the minority class. Recent studies has further divided the minority class into four sub-concepts: Safe, Borderline, Rare and Outlier using the majority-minority proportion at the neighborhood of every minority sample. Among the sub-concepts safe are easy to identify and classifiers are increasing inaccurate while classifying other subsequent sub-categories (Boarderline, Rare, Outlier). In some recent studies, heterogeneous value difference metric is used as the distance calculation mechanism for categorizing data. However, there are numerous other distance metrics whose effects on determining the sub-concepts are not explored yet. This research aimed at evaluating the effects of different distance metrics in the calculation of different sub-concepts within the minority class data. We have considered ten datasets and five distance metrics for the calculation. The datasets are divided into three categories: all categorical, mixed and fully numeric data. For the datasets with more categorical data outputs hugely differs between the distance functions. In those cases, relatively safer examples are calculated by the Euclidean and the Manhattan distance function. Our study shows that for categorizing minority data distance metrics should be chosen dataset wise.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.