Abstract

Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.

Highlights

  • The first step in the data mining or knowledge discovery in databases (KDD) process is to collect a certain amount of data for a specific defined problem

  • The differences in performance between most of the combinations are very small, that is, below 2% of classification accuracy, we still can find out that the best combination is based on genetic algorithms (GA) + multilayer perceptron (MLP) for the 10% missing rate and IB3 + k-nearest neighbor imputation (KNNI) for the 20% to 50% missing rates, which significantly outperform the other combinations and the baseline imputation methods (p < 0 01)

  • These results demonstrate that performing instance selection has a positive impact on missing value imputation over most numerical datasets

Read more

Summary

Introduction

The first step in the data mining or knowledge discovery in databases (KDD) process is to collect a certain amount of data for a specific defined problem. In practice, it is usually the case that the medical dataset collected for later data mining steps is not complete due to problems such as manual data entry procedures, incorrect measurements, and equipment errors. The collected datasets generally contain some missing (attribute) values or missing data [9, 21]. For many data mining algorithms, it is not possible to develop learning models when used over incomplete medical datasets. Despite the fact that some algorithms, such as decision trees, can handle incomplete datasets without any preprocessing support [24], the final analysis or mining results can be greatly affected by the incomplete datasets. The prediction performance of the constructed model trained by an incomplete dataset is questionable

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call