Abstract

The linear coefficient in a partially linear model with confounding variables can be estimated using double machine learning (DML). However, this DML estimator has a two-stage least squares (TSLS) interpretation and may produce overly wide confidence intervals. To address this issue, we propose a regularization and selection scheme, regsDML, which leads to narrower confidence intervals. It selects either the TSLS DML estimator or a regularization-only estimator depending on whose estimated variance is smaller. The regularization-only estimator is tailored to have a low mean squared error. The regsDML estimator is fully data driven. The regsDML estimator converges at the parametric rate, is asymptotically Gaussian distributed, and asymptotically equivalent to the TSLS DML estimator, but regsDML exhibits substantially better finite sample properties. The regsDML estimator uses the idea of k-class estimators, and we show how DML and k-class estimation can be combined to estimate the linear coefficient in a partially linear endogenous model. Empirical examples demonstrate our methodological and theoretical developments. Software code for our regsDML method is available in the R-package dmlalg.

Highlights

  • Linear models (PLMs) combine the flexibility of nonparametric approaches with ease of interpretation of linear models

  • We extended and regularized double machine learning (DML) in potentially overidentified partially linear models (PLMs) with hidden variables

  • A clinical study may experience an endogeneity issue if a treatment is not randomly assigned and subjects receiving different treatments differ in other ways than the treatment [73]

Read more

Summary

Introduction

Linear models (PLMs) combine the flexibility of nonparametric approaches with ease of interpretation of linear models. If a treatment is not randomly assigned in a clinical study, subjects receiving different treatments differ in other ways than only the treatment [73] Another situation where an explanatory variable is correlated with the error term occurs if the explanatory variable is determined simultaneously with the response [97]. We insert potentially biased machine learning (ML) estimates of the nuisance parameters E[A|W ], E[X|W ], and E[Y |W ] into this equation for β0. Three well-established k-class estimators are the limited information maximum likelihood (LIML) estimator [10, 4] and the Fuller(1) and Fuller(4) estimators [43] They have been developed for entirely linear models to overcome some deficiencies of TSLS. Empirical simulations demonstrate that regsDML typically leads to shorter confidence intervals than LIML, Fuller(1), and Fuller(4), while it still attains the nominal coverage level

Our contribution
Additional literature
An identifiability condition and the DML estimator
Identifiability condition
Alternative interpretations of β0
Formulation of the DML estimator and its asymptotic properties
Regularizing the DML estimator: regDML and regsDML
Estimation and asymptotic normality
Estimating the regularization parameter γ
Numerical experiments
Simulation example with random forests
Real data example
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call