Abstract

In practice, many medical domain datasets are incomplete, containing a proportion of incomplete data with missing attribute values. Missing value imputation can be performed to solve the problem of incomplete datasets. To impute missing values, some of the observed data (i.e., complete data) are generally used as the reference or training set, and then the relevant statistical and machine learning techniques are employed to produce estimations to replace the missing values. Since the collected dataset usually contains a certain number of feature dimensions, it is useful to perform feature selection for better pattern recognition. Therefore, the aim of this paper is to examine the effect of performing feature selection on missing value imputation of medical datasets. Experiments are carried out on five different medical domain datasets containing various feature dimensions. In addition, three different types of feature selection methods and imputation techniques are employed for comparison. The results show that combining feature selection and imputation is a better choice for many medical datasets. However, the feature selection algorithm should be carefully chosen in order to produce the best result. Particularly, the genetic algorithm and information gain models are suitable for lower dimensional datasets, whereas the decision tree model is a better choice for higher dimensional datasets.

Highlights

  • In many real-world medical domain problems, the datasets collected for data mining purposes are usually incomplete, containing missing values or missing data, such as pulmonary embolism data [1], DNA microarray data [2], metabolomics data [3], cardiovascular disease data [4], lung disease data [5], food composition data [6], traffic data [7], and other medical data [8].Many data mining and machine learning algorithms used in the data mining process are not able to effectively analyze incomplete datasets

  • The results are interesting, showing that for the arrhythmia dataset, which contains the largest number of features, performing feature selection by decision tree (DT) can allow the multilayer perceptron (MLP), k-nearest neighbor (KNN), and support vector machine (SVM)

  • The results show that DT is a better choice for the higher dimensional datasets as large numbers of features can be filtered out, while combining DT with the imputation models can provide the best result in the arrhythmia dataset and reasonably good performance in the breast cancer dataset

Read more

Summary

Introduction

In many real-world medical domain problems, the datasets collected for data mining purposes are usually incomplete, containing missing (attribute) values or missing data, such as pulmonary embolism data [1], DNA microarray data [2], metabolomics data [3], cardiovascular disease data [4], lung disease data [5], food composition data [6], traffic data [7], and other medical data [8]. Many data mining and machine learning algorithms used in the data mining process are not able to effectively analyze incomplete datasets. Directly using incomplete datasets for the purpose of data analysis can have a significant effect on the final conclusions that are drawn from the data [9].

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.