Abstract

Imbalanced class distribution in the medical dataset is a challenging task that hinders classifying disease correctly. It emerges when the number of healthy class instances being much larger than the disease class instances. To solve this problem, we proposed undersampling the healthy class instances to improve disease class classification. This model is named Hellinger Distance Undersampling (HDUS). It employs the Hellinger Distance to measure the resemblance between majority class instance and its neighbouring minority class instances to separate classes effectively and boost the discrimination power for each class. An extensive experiment has been conducted on four imbalanced medical datasets using three classifiers to compare HDUS with a baseline model and three state-of-the-art undersampling models. The outcomes display that HDUS can perform better than other models in terms of sensitivity, F1 measure, and balanced accuracy.

Highlights

  • Classification is a standard data mining process

  • To investigate the performance measures of the proposed Hellinger Distance Undersampling (HDUS) method, we used four imbalanced medical datasets using three classification algorithms including decision tree (DT), Support Vector Machine (SVM), and K-Nearest Neighbour (KNN) and they were compared with the baseline model and with three state-ofthe-art undersampling methods (Tomek link, random undersampling (RUS), and Edited nearest neighbour (ENN))

  • This paper proposed a novel model, HDUS, that handles the imbalanced classification problem in the medical datasets to improve the classification of the minority disease class

Read more

Summary

Introduction

Classification is a standard data mining process. It consists of two steps, building a model and testing a model. Most classification algorithms were mainly built to classify the balanced dataset, whereas a problem occurs when a dataset is imbalanced, which degrades the recognition power of the classifier [1]. If the problems of imbalanced class distribution are not addressed before implementing the classification procedures, the classifier appears to be biased towards the majority class cases while ignoring to classify the minority class cases correctly [3]. The problems of classifying imbalanced data often occur in real-life applications such as analyzing medical datasets, where the cases of patients with the disease are significantly lower than those without the disease. The classification model to predict cancer results in lower classification performance of abnormal class and incorrect prediction disease which leads to serious health risk

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.