Impact of Missing Data on Correlation Coefficient Values: Deletion and Imputation Methods for Data Preparation

Mohamed Shantal,Zalinda Othman,Azuraliza Abu Bakar

doi:10.11113/mjfas.v19n6.3098

Abstract

The correlation coefficient is one of the essential statistical techniques used to discover relationships among variables. Various techniques can quantify correlation, such as Pearson's, Spearman's, and Kendall's correlation coefficients, depending on the data type. As with any use of data, missing data will impact the availability of data, reducing it and potentially affecting the results. Furthermore, the removal of missing-value data from the study when using complete case analysis or available case analysis may result in selection biases. In this paper, we investigate the impact of missing data on the correlation coefficient value by calculating the difference between the correlation coefficient of the original complete dataset and that of a dataset with missing data. Two deletion strategies (Listwise and Pairwise) and three imputation strategies (Mean, k-Nearest Neighbors (k-NN), and Expectation-Maximization) were used to prepare the data before calculating the correlation coefficient. Unique correlation coefficient values were created by converting unique values to a one-dimensional array, and RMSE metrics were used to evaluate the experiments. Eight UCI and Kaggle datasets with different sizes and numbers of attributes were used in this study. The experiment results demonstrate that the Pairwise strategy and k-NN give good results on the correlation coefficient, respectively, when the missing rate is moderate or less. Pairwise uses all the available values and discards only the missing values of the related attribute, while k-NN fills the missing values with new values that produce correlation coefficient values close to the actual values.

Full Text