A Comparison of Different Estimation Methods to Handle Missing Data in Explanatory Variables

Manal Jabbar Salman

doi:10.24996/ijs.2020.61.12.20

Abstract

Missing data is one of the problems that may occur in regression models. This problem is usually handled by deletion mechanism available in statistical software. This method reduces statistical inference values because deletion affects sample size. In this paper, Expectation Maximization algorithm (EM), Multicycle-Expectation-Conditional Maximization algorithm (MC-ECM), Expectation-Conditional Maximization Either (ECME), and Recurrent Neural Networks (RNN) are used to estimate multiple regression models when explanatory variables have some missing values. Experimental dataset were generated using Visual Basic programming language with missing values of explanatory variables according to a missing mechanism at random general pattern and some ratios of missing values (10%, 20%, and 30%) with error variance values of 0.5, 1. 5, and 2, which were included in sample sizes of 25, 50, 100, and 500 and evaluated using Mean Squared Error (MSE). Simulation results show that RNN outperforms the other methods, followed by EM at small sample sizes.

Highlights

In applied statistics, studying data with missing values is an important topic because analysis of this data gives inaccurate and unreliable results and most specialist programmers rely on deleting these values when analyzing them [1]
At the sample size of n=100 and missing ratio of 30%, the Multicycle-Expectation-Conditional Maximization algorithm (MC-ECM) algorithm was the best according to the results obtained from the simulation program the rest sample to missing ratios, Recurrent Neural Networks (RNN) recorded the lower minimum mean square error values as compared to Expectation Maximization algorithm (EM), MC-ECM, and Expectation-Conditional Maximization Either (ECME)
5- Conclusions Based on Tables-(1-3) and Figure-5, using MMSE for the regression models is superior to other algorithms when there is missing values in explanatory variables in large sample sizes (N=100, 500) and for all error variance and missing ratios

Summary

Introduction

In applied statistics, studying data with missing values is an important topic because analysis of this data gives inaccurate and unreliable results and most specialist programmers rely on deleting these values when analyzing them [1]. Liu and Rubin (1995) used the EM algorithm and its extensions of ECM and ECME to obtain more efficient estimates of maximum likelihood (ML) and in models analyzing factors that may arise in the contexts of educational tests [3]. Harshanand (2013) studied the rainfall data that suffers missing values using neural networks (NN) and concluded that this method gives strong results, reflecting the uncertainty from missing values [5].

Methods

Results

Conclusion