Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

  • Abstract
  • Highlights & Summary
  • PDF
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

BackgroundThere is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.MethodsDatasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.ResultsPerforming a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.ConclusionThe results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

Similar Papers
  • Research Article
  • Cite Count Icon 13
  • 10.1161/strokeaha.115.007984
What is missing from my missing data plan?
  • May 7, 2015
  • Stroke
  • Sharon D Yeatts + 1 more

Under the intention-to-treat principle, all randomized subjects should be analyzed according to their randomly assigned treatment, regardless of treatment actually received or protocol compliance. Adherence to this principle requires that even subjects with missing outcome data be included in the analysis; in fact, the exclusion of such subjects can have important implications on power and bias. Statistical methods for dealing with missing data exist, but many questions remain unclear. Much statistical research has been devoted to the development and assessment of various methods for handling missing data.1 The choice of appropriate methodology requires assumptions on the mechanism underlying the missing data. All of these decisions should be made a priori, preferably before the trial starts but certainly before unblinding the trial. Related conversations between clinical investigators and the study statistician during the design phase often focus on more practical questions. Is there some threshold for the missing data rate below which the trial’s conclusions are unlikely to be affected? Under what circumstances can the missing data be excluded from the analysis without biasing estimation, or is imputation always the preferred approach? In this article, we discuss implications of missing outcome data from a practical standpoint. We describe potential reasons for missing data and suggest strategies to minimize its occurrence. We also present common imputation approaches and emphasize that because none of these approaches are universally preferred, the best analytic plan includes a series of sensitivity analyses. In any longitudinal trial where subjects are followed over some extensive period of time, lengthy follow-up makes missing data somewhat unavoidable. In stroke clinical trials, the primary outcome assessment often occurs at 90 days although there is evidence to suggest that additional follow-up may be beneficial. Subjects may expire, or withdraw informed consent, before primary outcome ascertainment. Subjects may become lost to the …

  • Research Article
  • 10.3390/jcm14113829
Missing Data in Orthopaedic Clinical Outcomes Research: A Sensitivity Analysis of Imputation Techniques Utilizing a Large Multicenter Total Shoulder Arthroplasty Database.
  • May 29, 2025
  • Journal of clinical medicine
  • Kevin A Hao + 9 more

Background: When missing data are present in clinical outcomes studies, complete-case analysis (CCA) is often performed, whereby patients with missing data are excluded. While simple, CCA analysis may impart selection bias and reduce statistical power, leading to erroneous statistical results in some cases. However, there exist more rigorous statistical approaches, such as single and multiple imputation, which approximate the associations that would have been present in a full dataset and preserve the study's power. The purpose of this study is to evaluate how statistical results differ when performed after CCA analysis versus imputation methods. Methods: This simulation study analyzed a sample dataset consisting of 2204 shoulders, with complete datapoints from a larger multicenter total shoulder arthroplasty database. From the sampled dataset of demographics, surgical characteristics, and clinical outcomes, we created five test datasets, ranging from 100 to 2000 shoulders, and simulated 10-50% missingness in the postoperative American Shoulder and Elbow Surgeons (ASES) score and range of motion in four planes in missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) patterns. Missingness in outcomes was remedied using CCA, three single imputation techniques, and two multiple imputation techniques. The imputation performance was evaluated relative to the native complete dataset using the root mean squared error (RMSE) and the mean absolute percentage error (MAPE). We also compared the mean and standard deviation (SD) of the postoperative ASES score and the results of multivariable linear and logistic regression to understand the effects of imputation on the study results. Results: The average overall RMSE and MAPE were similar for MCAR (22.6 and 27.2%) and MAR (19.2 and 17.7%) missingness patterns, but were substantially poorer for NMAR (37.5 and 79.2%); the sample size and the percentage of data missingness minimally affected RMSE and MAPE. Aggregated mean postoperative ASES scores were within 5% of the true value when missing data were remedied with CCA, and all candidate imputation methods for nearly all ranges of sample size and data missingness when data were MCAR or MAR, but not when data were NMAR. When data were MAR, CCA resulted in overestimates of the SD. When data were MCAR or MAR, the accuracy of the regression estimate (β or OR) and its corresponding 95% CI varied substantially based on the sample size and proportion of missing data for multivariable linear regression, but not logistic regression. When data were MAR, the width of the 95% CI was up to 300% larger when CCA was used, whereas most imputation methods maintained the width of the 95% CI within 50% of the true value. Single imputation with k-nearest neighbor (kNN) method and multiple imputation with predictive mean matching (MICE-PMM) best-reproduced point estimates and intervariable relationships resembling the native dataset. Availability of correlated outcome scores improved the RMSE, MAPE, accuracy of the mean postoperative ASES score, and multivariable linear regression model estimates. Conclusions: Complete-case analysis can introduce selection bias when data are MAR, and it results in loss of statistical power, resulting in loss of precision (i.e., expansion of the 95% CI) and predisposition to false-negative findings. Our data demonstrate that imputation can reliably reproduce missing clinical data and generate accurate population estimates that closely resemble results derived from native primary shoulder arthroplasty datasets (i.e., prior to simulated data missingness). Further study of the use of imputation in clinical database research is critical, as the use of CCA may lead to different conclusions in comparison to more rigorous imputation approaches.

  • Research Article
  • 10.1111/jebm.70058
A Systematic Survey of the Optimal Strategy for Dealing With Missing Binary Outcomes in Simulation Studies of Randomized Controlled Trials.
  • Sep 1, 2025
  • Journal of evidence-based medicine
  • Yanjiao Shen + 25 more

To summarize the optimal strategies for dealing with missing binary outcome data (MBOD) in randomized controlledtrials (RCTs) as informed by simulation studies, and to summarize the quality of reporting in these studies. To identify simulation studies comparing at least two strategies to deal with MBOD and evaluating their performance (bias, coverage and power), we searched MEDLINE, EMBASE, Cochrane Central Register of Controlled Trials via Ovid, Web of Science, and JSTOR from their inception up to December 20, 2023. We evaluated reporting quality using established criteria for simulation studies in medical statistics. We summarized data using descriptive statistics and a narrative synthesis. Our search identified 29,460 citations, of which five proved eligible. Multiple imputation (MI), investigated in five studies, showed consistently good performance in all domains tested for missing completely at random (MCAR) and missing at random (MAR) but with important limitations in missing not at random (MNAR). Complete case analysis (CCA), investigated in four studies of which three addressed model-based CCA, performed well in bias and coverage under MAR and MCAR, but less well for MNAR. One study reported that non-model-based CCA performed poorly with respect to bias under MAR. Non-model-based single imputation, investigated in two studies, showed consistently poor performance across all domains tested for MAR, MCAR and MNAR. One study reported that model-based single imputation performed well with respect to bias under MAR. Regarding reporting quality, all studies reported the aims, dependence of simulated data sets, scenarios and statistical methods evaluated, number of simulations performed, justification of data generation and criteria used to evaluate the simulation performance. None of the studies reported the starting seeds, random number generators and failures occurring during simulation. Simulation studies address methods to deal with MBOD in RCTs, provided evidence that the MI approach is superior with respect to bias and coverage compared with CCA. Non-model-based single imputation generally performed poorly.

  • Research Article
  • Cite Count Icon 128
  • 10.1186/1471-2288-10-112
Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study.
  • Dec 1, 2010
  • BMC Medical Research Methodology
  • Andrea Marshall + 2 more

BackgroundThe appropriate handling of missing covariate data in prognostic modelling studies is yet to be conclusively determined. A resampling study was performed to investigate the effects of different missing data methods on the performance of a prognostic model.MethodsObserved data for 1000 cases were sampled with replacement from a large complete dataset of 7507 patients to obtain 500 replications. Five levels of missingness (ranging from 5% to 75%) were imposed on three covariates using a missing at random (MAR) mechanism. Five missing data methods were applied; a) complete case analysis (CC) b) single imputation using regression switching with predictive mean matching (SI), c) multiple imputation using regression switching imputation, d) multiple imputation using regression switching with predictive mean matching (MICE-PMM) and e) multiple imputation using flexible additive imputation models. A Cox proportional hazards model was fitted to each dataset and estimates for the regression coefficients and model performance measures obtained.ResultsCC produced biased regression coefficient estimates and inflated standard errors (SEs) with 25% or more missingness. The underestimated SE after SI resulted in poor coverage with 25% or more missingness. Of the MI approaches investigated, MI using MICE-PMM produced the least biased estimates and better model performance measures. However, this MI approach still produced biased regression coefficient estimates with 75% missingness.ConclusionsVery few differences were seen between the results from all missing data approaches with 5% missingness. However, performing MI using MICE-PMM may be the preferred missing data approach for handling between 10% and 50% MAR missingness.

  • Research Article
  • 10.1080/00949655.2025.2558859
Multiple imputation under missing not at random: incorporating response indicators into sequential imputation
  • Oct 7, 2025
  • Journal of Statistical Computation and Simulation
  • Micha Fischer + 2 more

Multiple imputation (MI) of missing values is mostly applied under the assumption of missing at random (MAR), but the alternative missing not at random (MNAR) assumption may be more plausible. MI approaches that include response indicators (RIs) for incomplete covariates in predictions of missing values assume a form of MNAR. This paper investigates MI under MNAR assumptions using RIs as covariates. We review literature on imputation under MNAR and prediction with incomplete covariates. For the case of two incomplete variables, we describe the MNAR assumptions implied by including RIs in the imputation model, for normal and categorical data. We then compare the performance of different MI strategies in a simulation study, focussed on the property of inferences for regression coefficients and predictions of missing values. We find that for data generated under MAR, methods including RIs perform as well as those without them. In MNAR data scenarios, methods including RIs can improve performance for both analytic and descriptive inference.

  • Research Article
  • Cite Count Icon 39
  • 10.1177/0272989x13492203
Multiple Imputation Methods for Handling Missing Data in Cost-effectiveness Analyses That Use Data from Hierarchical Studies
  • Aug 1, 2013
  • Medical Decision Making
  • Manuel Gomes + 3 more

Multiple imputation (MI) has been proposed for handling missing data in cost-effectiveness analyses (CEAs). In CEAs that use cluster randomized trials (CRTs), the imputation model, like the analysis model, should recognize the hierarchical structure of the data. This paper contrasts a multilevel MI approach that recognizes clustering, with single-level MI and complete case analysis (CCA) in CEAs that use CRTs. We consider a multilevel MI approach compatible with multilevel analytical models for CEAs that use CRTs. We took fully observed data from a CEA that evaluated an intervention to improve diagnosis of active labor in primiparous women using a CRT (2078 patients, 14 clusters). We generated scenarios with missing costs and outcomes that differed, for example, according to the proportion with missing data (10%-50%), the covariates that predicted missing data (individual, cluster-level), and the missingness mechanism: missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). We estimated incremental net benefits (INBs) for each approach and compared them with the estimates from the fully observed data, the "true" INBs. When costs and outcomes were assumed to be MCAR, the INBs for each approach were similar to the true estimates. When data were MAR, the point estimates from the CCA differed from the true estimates. Multilevel MI provided point estimates and standard errors closer to the true values than did single-level MI across all settings, including those in which a high proportion of observations had cost and outcome data MAR and when data were MNAR. Multilevel MI accommodates the multilevel structure of the data in CEAs that use cluster trials and provides accurate cost-effectiveness estimates across the range of circumstances considered.

  • Research Article
  • Cite Count Icon 3
  • 10.1136/bmjqs-2023-016387
Handling missing values in the analysis of between-hospital differences in ordinal and dichotomous outcomes: a simulation study
  • Sep 21, 2023
  • BMJ Quality & Safety
  • Reinier C A Van Linschoten + 12 more

Missing data are frequently encountered in registries that are used to compare performance across hospitals. The most appropriate method for handling missing data when analysing differences in outcomes between hospitals...

  • Research Article
  • Cite Count Icon 25
  • 10.1186/1742-5573-8-5
The use of complete-case and multiple imputation-based analyses in molecular epidemiology studies that assess interaction effects
  • Oct 6, 2011
  • Epidemiologic Perspectives & Innovations : EP+I
  • Manisha Desai + 3 more

BackgroundIn molecular epidemiology studies biospecimen data are collected, often with the purpose of evaluating the synergistic role between a biomarker and another feature on an outcome. Typically, biomarker data are collected on only a proportion of subjects eligible for study, leading to a missing data problem. Missing data methods, however, are not customarily incorporated into analyses. Instead, complete-case (CC) analyses are performed, which can result in biased and inefficient estimates.MethodsThrough simulations, we characterized the performance of CC methods when interaction effects are estimated. We also investigated whether standard multiple imputation (MI) could improve estimation over CC methods when the data are not missing at random (NMAR) and auxiliary information may or may not exist.ResultsCC analyses were shown to result in considerable bias and efficiency loss. While MI reduced bias and increased efficiency over CC methods under specific conditions, it too resulted in biased estimates depending on the strength of the auxiliary data available and the nature of the missingness. In particular, CC performed better than MI when extreme values of the covariate were more likely to be missing, while MI outperformed CC when missingness of the covariate related to both the covariate and outcome. MI always improved performance when strong auxiliary data were available. In a real study, MI estimates of interaction effects were attenuated relative to those from a CC approach.ConclusionsOur findings suggest the importance of incorporating missing data methods into the analysis. If the data are MAR, standard MI is a reasonable method. Auxiliary variables may make this assumption more reasonable even if the data are NMAR. Under NMAR we emphasize caution when using standard MI and recommend it over CC only when strong auxiliary data are available. MI, with the missing data mechanism specified, is an alternative when the data are NMAR. In all cases, it is recommended to take advantage of MI's ability to account for the uncertainty of these assumptions.

  • Research Article
  • 10.1186/s12874-025-02594-2
Comparison of methods to handle missing values in a continuous index test in a diagnostic accuracy study – a simulation study
  • May 27, 2025
  • BMC Medical Research Methodology
  • Katharina Stahlmann + 4 more

BackgroundMost diagnostic accuracy studies have applied a complete case analysis (CCA) or single imputation approach to address missing values in the index test, which may lead to biased results. Therefore, this simulation study aims to compare the performance of different methods in estimating the AUC of a continuous index test with missing values in a single-test diagnostic accuracy study.MethodsWe simulated data for a reference standard, continuous index test, and three covariates using different sample sizes, prevalences of the target condition, correlations between index test and covariates, and true AUCs. Subsequently, missing values were induced for the continuous index test, assuming varying proportions of missing values and missingness mechanisms. Seven methods (multiple imputation (MI), empirical likelihood, and inverse probability weighting approaches) were compared to a CCA in terms of their performance to estimate the AUC given missing values in the index test.ResultsUnder missing completely at random (MCAR) and many missing values, CCA gives good results for a small sample size and all methods perform well for a large sample size. If missing values are missing at random (MAR), all methods are severely biased if the sample size and prevalence are small. An augmented inverse probability weighting method and standard MI methods perform well with higher prevalence and larger sample size, respectively. Most methods give biased results if missing values are missing not at random (MNAR) and the correlation or the sample size and prevalence are low. Methods using the covariates improve with increasing correlation.ConclusionsMost methods perform well if the proportion of missing values is small. Given a higher proportion of missing values and MCAR, we would recommend to conduct a CCA and standard MI methods for a small and large sample size, respectively. In the absence of better alternatives we recommend to conduct a CCA and to discuss its limitations, if the sample size is small, and missing values are M(N)AR. Standard MI methods and the augmented inverse probability approach may be a good alternative, if the sample size and/or correlation increases. All methods are biased under MNAR and a low correlation.

  • Abstract
  • 10.1016/s0924-9338(11)72279-9
P01-568 - Using the CES-D scale in a large cohort study and dealing with missing data: Application to the French E3N cohort
  • Mar 1, 2011
  • European Psychiatry
  • N Resseguier + 3 more

P01-568 - Using the CES-D scale in a large cohort study and dealing with missing data: Application to the French E3N cohort

  • Conference Article
  • 10.1136/jech-2015-206256.18
OP18 Using linked administrative data to reduce bias due to missing outcome data in exposure-outcome estimates: a study of the association between breastfeeding and iq using simulations and data from a birth cohort
  • Aug 31, 2015
  • Journal of Epidemiology and Community Health
  • Rp Cornish + 4 more

Background Most epidemiological studies have missing information, leading to reduced power and potential bias. Exposure-outcome associations will generally be biassed if the outcome variable is missing not at random (MNAR). Linkage to administrative data containing a proxy for the outcome allows assessment of MNAR. We used data from the Avon Longitudinal Study of Parents and Children (ALSPAC) and simulations to examine bias in the association between infant breastfeeding and IQ at 15 years, using linked school attainment data as a proxy for IQ. Methods ALSPAC : Subjects were those who enrolled in 1990–91 and were alive at one year (n = 13,795), of whom 36% had IQ measured at 15. For those with missing IQ, 79% had data on attainment at age 16 obtained through linkage to the National Pupil Database. Breastfeeding information was collected via questionnaire at 1, 6 and 15 months. A number of potential confounders/factors predictive of non-response were collected during pregnancy. We estimated the association between duration of breastfeeding and IQ using a complete case analysis, multiple imputation (MI), and MI including linked attainment data. Simulations : In the simulations we changed the strength of association between the outcome and the linked proxy, the proportion of missing data, and the extent to which the outcome was MNAR. Results IQ measured at 15 in ALSPAC was MNAR – individuals with higher attainment were less likely to have missing IQ, even after adjusting for socio-demographic factors. The correlation between IQ and the main attainment variable was 0.59. Both complete case analysis and MI underestimated the association between breastfeeding and IQ compared to MI informed by linkage (mean difference in IQ comparing those breastfed for at least 6 months to those breastfed for less than one month was 4.2 (95% CI 3.4,5.0) using MI informed by linkage but 3.5 (2.5,4.4) in the complete case analysis). In simulations, including the linked proxy reduced bias and increased precision in all cases, although improvements were small when the correlation between the outcome and its proxy was low (.5). Conclusion Linkage to administrative data containing a proxy for the outcome variable allows the MNAR assumption to be tested and more efficient analyses to be performed. Key limiting factors are the strength of association between the outcome and its proxy and coverage of the linked data; in our case, where the correlation was modest and linked data were not available for all individuals, some bias may remain.

  • Research Article
  • Cite Count Icon 9
  • 10.1002/bimj.201900117
Approaches for missing covariate data in logistic regression with MNAR sensitivity analyses.
  • Jan 20, 2020
  • Biometrical Journal
  • Ralph C Ward + 2 more

Data with missing covariate values but fully observed binary outcomes are an important subset of the missing data challenge. Common approaches are complete case analysis (CCA) and multiple imputation (MI). While CCA relies on missing completely at random (MCAR), MI usually relies on a missing at random (MAR) assumption to produce unbiased results. For MI involving logistic regression models, it is also important to consider several missing not at random (MNAR) conditions under which CCA is asymptotically unbiased and, as we show, MI is also valid in some cases. We use a data application and simulation study to compare the performance of several machine learning and parametric MI methods under a fully conditional specification framework (MI-FCS). Our simulation includes five scenarios involving MCAR, MAR, and MNAR under predictable and nonpredictable conditions, where "predictable" indicates missingness is not associated with the outcome. We build on previous results in the literature to show MI and CCA can both produce unbiased results under more conditions than some analysts may realize. When both approaches were valid, we found that MI-FCS was at least as good as CCA in terms of estimated bias and coverage, and was superior when missingness involved a categorical covariate. We also demonstrate how MNAR sensitivity analysis can build confidence that unbiased results were obtained, including under MNAR-predictable, when CCA and MI are both valid. Since the missingness mechanism cannot be identified from observed data, investigators should compare results from MI and CCA when both are plausibly valid, followed by MNAR sensitivity analysis.

  • Abstract
  • 10.1136/jech.2011.142976m.65
P2-538 Using the CES-D scale in a large cohort and dealing with missing data: application to the French E3N cohort
  • Aug 1, 2011
  • Journal of Epidemiology and Community Health
  • N Resseguier + 3 more

IntroductionThe CES-D scale is commonly used to assess depressive symptoms (DS) in large population-based studies. Missing data (MD) in one or several of the 20 items of the scale are...

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 15
  • 10.1186/s12874-016-0188-1
How to deal with missing longitudinal data in cost of illness analysis in Alzheimer's disease-suggestions from the GERAS observational study.
  • Jul 18, 2016
  • BMC Medical Research Methodology
  • Mark Belger + 10 more

BackgroundMissing data are a common problem in prospective studies with a long follow-up, and the volume, pattern and reasons for missing data may be relevant when estimating the cost of illness. We aimed to evaluate the effects of different methods for dealing with missing longitudinal cost data and for costing caregiver time on total societal costs in Alzheimer’s disease (AD).MethodsGERAS is an 18-month observational study of costs associated with AD. Total societal costs included patient health and social care costs, and caregiver health and informal care costs. Missing data were classified as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). Simulation datasets were generated from baseline data with 10–40 % missing total cost data for each missing data mechanism. Datasets were also simulated to reflect the missing cost data pattern at 18 months using MAR and MNAR assumptions. Naïve and multiple imputation (MI) methods were applied to each dataset and results compared with complete GERAS 18-month cost data. Opportunity and replacement cost approaches were used for caregiver time, which was costed with and without supervision included and with time for working caregivers only being costed.ResultsTotal costs were available for 99.4 % of 1497 patients at baseline. For MCAR datasets, naïve methods performed as well as MI methods. For MAR, MI methods performed better than naïve methods. All imputation approaches were poor for MNAR data. For all approaches, percentage bias increased with missing data volume. For datasets reflecting 18-month patterns, a combination of imputation methods provided more accurate cost estimates (e.g. bias: −1 % vs −6 % for single MI method), although different approaches to costing caregiver time had a greater impact on estimated costs (29–43 % increase over base case estimate).ConclusionsMethods used to impute missing cost data in AD will impact on accuracy of cost estimates although varying approaches to costing informal caregiver time has the greatest impact on total costs. Tailoring imputation methods to the reason for missing data will further our understanding of the best analytical approach for studies involving cost outcomes.Electronic supplementary materialThe online version of this article (doi:10.1186/s12874-016-0188-1) contains supplementary material, which is available to authorized users.

  • Research Article
  • Cite Count Icon 55
  • 10.1002/sim.6902
A multiple imputation approach for MNAR mechanisms compatible with Heckman's model.
  • Feb 18, 2016
  • Statistics in Medicine
  • Jacques‐Emmanuel Galimard + 3 more

Standard implementations of multiple imputation (MI) approaches provide unbiased inferences based on an assumption of underlying missing at random (MAR) mechanisms. However, in the presence of missing data generated by missing not at random (MNAR) mechanisms, MI is not satisfactory. Originating in an econometric statistical context, Heckman's model, also called the sample selection method, deals with selected samples using two joined linear equations, termed the selection equation and the outcome equation. It has been successfully applied to MNAR outcomes. Nevertheless, such a method only addresses missing outcomes, and this is a strong limitation in clinical epidemiology settings, where covariates are also often missing. We propose to extend the validity of MI to some MNAR mechanisms through the use of the Heckman's model as imputation model and a two-step estimation process. This approach will provide a solution that can be used in an MI by chained equation framework to impute missing (either outcomes or covariates) data resulting either from a MAR or an MNAR mechanism when the MNAR mechanism is compatible with a Heckman's model. The approach is illustrated on a real dataset from a randomised trial in patients with seasonal influenza. Copyright © 2016 John Wiley & Sons, Ltd.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon
Setting-up Chat
Loading Interface