Imputation Methods for Longitudinal Data: A Comparative Study

Ahmed Mahmoud Gad

doi:10.11648/j.ijsd.20170304.13

Abstract

Longitudinal studies play an important role in scientific researches. The defining characteristic of the longitudinal studies is that observations are collected from each subject repeatedly over time, or under different conditions. Missing values are common in longitudinal studies. The presence of missing values is always a fundamental challenge since it produces potential bias, even in well controlled conditions. Three different missing data mechanisms are defined; missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Several imputation methods have been developed in literature to handle missing values in longitudinal data. The most commonly used imputation methods include complete case analysis (CCA), mean imputation (Mean), last observation carried forward (LOCF), hot deck (HOT), regression imputation (Regress), K-nearest neighbor (KNN), The expectation maximization (EM) algorithm, and multiple imputation (MI). In this article, a comparative study is conducted to investigate the efficiency of these eight imputation methods under different missing data mechanisms. The comparison is conducted through simulation study. It is concluded that the MI method is the most effective method as it has the least standard errors. The EM algorithm has the largest relative bias. The different methods are also compared via real data application.

Highlights

Longitudinal studies become an increasingly common research area especially in the field of public health and medical sciences
The complete case analysis (CCA) method should be considered as the first choice of imputation even in missing completely at random (MCAR)
The performance of CCA was trembled in the missing at random (MAR) and the missing not at random (MNAR) setting

Summary

Introduction

Longitudinal studies become an increasingly common research area especially in the field of public health and medical sciences. Such studies are designed to investigate changes in a specific variable, which is measured repeatedly either at different times or under different conditions. Missing values are common in longitudinal studies because some individuals may miss a planned visit. There are many possible causes leading to missing values including failure of measurement, accidents, errors resulted from collecting or entering data, refusal to continue, or other administrative reasons. Whenever there are missing values, there is loss of information, which causes reduction in efficiency. Under certain circumstances, missing data can introduce bias and thereby lead to misleading inferences about the parameters

Objectives

Methods

Conclusion