Effective diagnosis of heart disease imposed by incomplete data based on fuzzy random forest

Elzhan Zeinulla,Adnan Yazici,Karina Bekbayeva

doi:10.1109/fuzz48607.2020.9177531

Abstract

This study presents data preprocessing and imputation techniques for creating a model from medical sensor data. We aim to solve the problem of creating a framework to diagnose heart diseases with an incomplete and dirty data, which is common with medical data. The medical dataset is often incomplete and dirty due to its small size, imbalance and many missing, false, inaccurate data. In this study, we utilize the synthetic minority oversampling technique with the combination of Tomek links to increase the size and eliminate the imbalance of the dataset. We performed a number of experiments and measurements on the Cleveland dataset and conducted a comparative study of various prediction models with recent algorithms in the literature. In order to process additional data from Budapest, Zurich and Basel, we apply the technique of semi-supervised pseudo-labelling, which means that the model has been trained on unlabeled data and combined with labelled data by predicting unlabeled values and making them pseudo-labelled. Then, the same algorithm that we used for Cleveland dataset was applied for the entire dataset. As the main classifier, Fuzzy Random Forest technique was implemented. The final accuracy of the approach proposed in this study is 93.4%, with the specificity and sensitivity values of 96.92% and 89.99%, respectively, which is superior to previous models included in the literature.

Full Text