The Feature Selection Effect on Missing Value Imputation of Medical Datasets

Chia-Hui Liu,Min-Wei Huang,Kuen-Liang Sue,Chih-Fong Tsai

doi:10.3390/app10072344

Abstract

In practice, many medical domain datasets are incomplete, containing a proportion of incomplete data with missing attribute values. Missing value imputation can be performed to solve the problem of incomplete datasets. To impute missing values, some of the observed data (i.e., complete data) are generally used as the reference or training set, and then the relevant statistical and machine learning techniques are employed to produce estimations to replace the missing values. Since the collected dataset usually contains a certain number of feature dimensions, it is useful to perform feature selection for better pattern recognition. Therefore, the aim of this paper is to examine the effect of performing feature selection on missing value imputation of medical datasets. Experiments are carried out on five different medical domain datasets containing various feature dimensions. In addition, three different types of feature selection methods and imputation techniques are employed for comparison. The results show that combining feature selection and imputation is a better choice for many medical datasets. However, the feature selection algorithm should be carefully chosen in order to produce the best result. Particularly, the genetic algorithm and information gain models are suitable for lower dimensional datasets, whereas the decision tree model is a better choice for higher dimensional datasets.

Highlights

In many real-world medical domain problems, the datasets collected for data mining purposes are usually incomplete, containing missing values or missing data, such as pulmonary embolism data [1], DNA microarray data [2], metabolomics data [3], cardiovascular disease data [4], lung disease data [5], food composition data [6], traffic data [7], and other medical data [8].Many data mining and machine learning algorithms used in the data mining process are not able to effectively analyze incomplete datasets
The results are interesting, showing that for the arrhythmia dataset, which contains the largest number of features, performing feature selection by decision tree (DT) can allow the multilayer perceptron (MLP), k-nearest neighbor (KNN), and support vector machine (SVM)
The results show that DT is a better choice for the higher dimensional datasets as large numbers of features can be filtered out, while combining DT with the imputation models can provide the best result in the arrhythmia dataset and reasonably good performance in the breast cancer dataset

Summary

Introduction

In many real-world medical domain problems, the datasets collected for data mining purposes are usually incomplete, containing missing (attribute) values or missing data, such as pulmonary embolism data [1], DNA microarray data [2], metabolomics data [3], cardiovascular disease data [4], lung disease data [5], food composition data [6], traffic data [7], and other medical data [8]. Many data mining and machine learning algorithms used in the data mining process are not able to effectively analyze incomplete datasets. Directly using incomplete datasets for the purpose of data analysis can have a significant effect on the final conclusions that are drawn from the data [9].

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Mar 29, 2020
Citations: 23	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

The Feature Selection Effect on Missing Value Imputation of Medical Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique.
Yuanting Yan ... Xiuquan Du
International Journal of Molecular Sciences | VOL. 19
Yuanting Yan, et. al.Yuanting Yan ... Xiuquan Du
30 Oct 2018
International Journal of Molecular Sciences | VOL. 19

On mining incomplete medical datasets: Ordering imputation and classification.
Chih-Wen Chen ... Ya-Han Hu
Technology and health care : official journal of the European Society for Engineering and Medicine | VOL. 23
Chih-Wen Chen, et. al.Chih-Wen Chen ... Ya-Han Hu
22 Sep 2015
Technology and health care : official journal of the European Society for Engineering and Medicine | VOL. 23

Decision letter: Applying causal discovery to single-cell analyses using CausalCell
Babak Momeni ... Anna Akhmanova
-
Babak Momeni, et. al.Babak Momeni ... Anna Akhmanova
14 Aug 2022
14 Aug 2022

Author response: Applying causal discovery to single-cell analyses using CausalCell
Yujian Wen ... Hai Zhang
-
Yujian Wen, et. al.Yujian Wen ... Hai Zhang
23 Aug 2022
23 Aug 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Feature Selection Effect on Missing Value Imputation of Medical Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences