Missing Value Imputation Using Stratified Supervised Learning for Cardiovascular Data

Darryl Nd,Rahman Mm

doi:10.4172/2229-8711.s1113

Abstract

Legacy (and current) medical datasets are rich source of information and knowledge. However, the use of most legacy medical datasets is beset with problems. One of the most often faced is the problem of missing data, often due to oversights in data capture or data entry procedures. Algorithms commonly used in the analysis of data often depend on a complete data set. Missing value imputation offers a solution to this problem. This may result in the generation of synthetic data, with artificially induced missing values, but simply removing the incomplete data records often produces the best classifier results. With legacy data, simply removing the records from the original datasets can significantly reduce the data volume and often affect the class balance of the dataset. A suitable method for missing value imputation is very much needed to produce good quality datasets for better analysing data resulting from clinical trials. This paper proposes a framework for missing value imputation using stratified machine learning methods. We explore machine learning technique to predict missing value for incomplete clinical (cardiovascular) data, with experiments comparing this with other standard methods. Two machine learning (classifier) algorithms, fuzzy unordered rule induction algorithm and decision tree, plus other machine learning algorithms (for comparison purposes) are used to train on complete data and subsequently predict missing values for incomplete data. The complete datasets are classified using decision tree, neural network, K-NN and K-Mean clustering. The classification performances are evaluated using sensitivity, specificity, accuracy, positive predictive value and negative predictive value. The results show that final classifier performance can be significantly improved for all class labels when stratification was used with fuzzy unordered rule induction algorithm to predict missing attribute values.

Highlights

Legacy medical datasets are rich source of information and knowledge, and there is a growing trend with research funders expecting the data resulting from clinical trials to be used beyond the originating study
Machine learning methods can be used for predicting missing values; for example by using rule induction algorithm in which rules are induced from the original complete data set, with missing attribute values ignored
The results are compared with some other non-stratified machine learning based missing value imputation methods using decision tree, SVM, K

Summary

Introduction

Legacy medical datasets are rich source of information and knowledge, and there is a growing trend with research funders expecting the data resulting from clinical trials to be used beyond the originating study. Machine learning methods can be used for predicting missing values; for example by using rule induction algorithm in which rules are induced from the original complete data set, with missing attribute values ignored. The results are compared with some other non-stratified machine learning based missing value imputation methods using decision tree, SVM, K-. Stratified machine learning based missing value imputation rules are later applied to the incomplete data for predicting the missing attribute values. The values of class attribute are generated according to the following heuristic model [30]: an instance (cardiovascular patient) is classified into “high” if the patient’s death or severe cardiovascular event (e.g. stroke, myocardial relapse or cardiovascular arrest) appears within 30 days after an operation. The k-NN is one of the simplest machine learning algorithms where an object is classified by a majority vote of its neighbours, where the object being allocated to the class most common amongst its “k” nearest neighbours (k is a positive integer, typically small)

Experiments

Findings

Conclusion