Abstract

Abstract Background Flexible, data-adaptive algorithms (machine learning; ML) for nuisance parameter estimation in epidemiologic causal inference have promising asymptotic properties for complex, high-dimensional data. However, recently proposed applications (e.g. targeted maximum likelihood estimation; TMLE) may produce biases parameter and standard error estimates in common real-world cohort settings. The relative performance of these novel estimators over simpler approaches in such settings is unclear. Methods We apply double-crossfit TMLE, augmented inverse probability weighting (AIPW), and standard IPW to simple simulations (5 covariates) and “real-world” data using covariate-structure-preserving (“plasmode”) simulations of 1,178 subjects and 331 covariates from a longitudinal birth cohort. We evaluate various data generating and estimation scenarios including: under- and over- (e.g. excess orthogonal covariates) identification, poor data support, near-instruments, and mis-specified biological interactions. We also track representative computation times. Results We replicate optimal performance of cross-fit, doubly robust estimators in simple data generating processes. However, in nearly every real world-based scenario, estimators fit with parametric learners outperform those that include non-parametric learners in terms of mean bias and confidence interval coverage. Even when correctly specified, estimators fit with non-parametric algorithms (xgboost, random forest) performed poorly (e.g. 24% bias, 57% coverage vs. 10% bias, 79% coverage for parametric fit), at times underperforming simple IPW. Conclusions In typical epidemiologic data sets, double-crossfit estimators fit with simple smooth, parametric learners may be the optimal solution, taking 2-5 times less computation time than flexible non-parametric models, while having equal or better performance. No approaches are optimal, and estimators should be compared on simulations close to the source data. Key messages In epidemiologic studies, use of flexible non-parametric algorithms for effect estimation should be strongly justified (i.e. high-dimensional covariates) and performed with care. Parametric learners may be a safer option with few drawbacks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.