Abstract

Using machine learning to estimate the heterogeneity of treatment effects (HTE) in randomized experiments is common when firms and digital platforms seek to understand how individuals differ in their responses to a policy. However, will the average effect from the HTE model align with simple subgroup average effect estimates from random experiments? Using a large-scale random experiment on Facebook, we observe a substantial discrepancy between machine learning-based treatment effect estimates and the difference-in-means estimator from the same random experiment. We propose to use the quantile-quantile plot to diagnose whether there exists bias in the model-based estimates. To correct for the bias, we provide a model-agnostic method, in the vein of \citet{platt1999probabilistic} technique used in supervised learning, to calibrate black-box estimates of HTEs to known unbiased average effect estimates, ensuring that sign and magnitude approximate experimental benchmarks. Our method requires no additional data beyond what's necessary for estimating HTEs, and it can be scaled to arbitrarily large datasets. Our technique enables the use of stacking to ensemble estimates from multiple HTE models based on their out-of-sample estimates, improving performance. We demonstrate the effectiveness of our method in a random experiment run on Facebook. We also use extensive synthetic simulations to demonstrate two mechanisms of the bias and show the effectiveness of our method. The issue observed by our study and the proposed diagnostic and correction approach have strong implications for and broad applications in the IS communities and for online platforms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call