Abstract
Abstract Background Causal inference from cohort studies is central to epidemiological research. Targeted Maximum Likelihood Estimation (TMLE) is an appealing doubly robust method for causal effect estimation, but it is unclear how missing data should be handled when it is used in conjunction with machine learning approaches for the exposure and outcome models. This is problematic because missing data are ubiquitous and can result in biased estimates and loss of precision if handled inappropriately. Methods Based on a motivating example from the Victorian Adolescent Health Cohort Study, we conducted a simulation study to evaluate the performance of available approaches for handling missing data when using TMLE with machine learning. These included complete-case analysis; an extended TMLE approach incorporating an outcome missingness probability model; the missing indicator approach for missing covariate data (MCMI); and multiple imputation (MI) using standard parametric approaches or machine learning algorithms. We considered 11 missingness mechanisms typical in cohort studies, and a simple and a complex setting, in which exposure and outcome generation models included two-way and higher-order interactions. Results MI using regression with no interactions and MI with random forest yielded estimates with the highest bias. MI with regression including two-way interactions was the best performing method overall. Of the non-MI approaches, MCMI performed the worst Conclusions When using TMLE with machine learning to estimate the average causal effect, avoiding standard MI with no interactions and MCMI is recommended. Key messages We provide novel guidance for handling missing data for causal effect estimation using TMLE.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have