Abstract

The accuracy of the statistical learning model depends on the learning technique used which in turn depends on the dataset’s values. In most research studies, the existence of missing values (MVs) is a vital problem. In addition, any dataset with MVs cannot be used for further analysis or with any data driven tool especially when the percentage of MVs are high. In this paper, the authors propose a novel algorithm for dealing with MVs depending on the feature selection (FS) of similarity classifier with fuzzy entropy measure. The proposed algorithm imputes MVs in cumulative order. The candidate feature to be manipulated is selected using similarity classifier with Parkash’s fuzzy entropy measure. The predictive model to predict MVs within the candidate feature is the Bayesian Ridge Regression (BRR) technique. Furthermore, any imputed features will be incorporated within the BRR equation to impute the MVs in the next chosen incomplete feature. The proposed algorithm was compared against some practical state-of-the-art imputation methods by conducting an experiment on four medical datasets which were gathered from several databases repository with MVs generated from the three missingness mechanisms. The evaluation metrics of mean absolute error (MAE), root mean square error (RMSE) and coefficient of determination (R2 score) were used to measure the performance. The results exhibited that performance vary depending on the size of the dataset, amount of MVs and the missingness mechanism type. Moreover, compared to other methods, the results showed that the proposed method gives better accuracy and less error in most cases.

Highlights

  • missing values (MVs) are considered a critical problem that can occur in many scientific areas such as biological, psychological, or medical [1]

  • The results exhibit that the performance differs from one algorithm to another depending on the dimension of the dataset, the missingness mechanism type, and the amount of MVs in the dataset

  • MVs are considered a critical problem in pattern recognition, Machine learning (ML) and data mining applications

Read more

Summary

Introduction

MVs are considered a critical problem that can occur in many scientific areas such as biological, psychological, or medical [1]. The existence of MVs within a dataset can result in problems, for instance, bad data analysis, reducing the research results obtained from such dataset and presenting amount of bias [3]. To this end, significant information is incorporated within MVs which should be manipulated before using the incomplete dataset with any data driven tool. Several imputation algorithms may result in poor imputation and may fail in handling all MVs in the dataset. They may not deal with all missingness mechanisms.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call