Abstract
It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets before model fitting can produce poor out-of-study prediction performance when datasets are heterogeneous. Theoretical and applied work has shown multistudy ensembling to be a viable alternative that leverages the variability across datasets in a manner that promotes model generalizability. Multistudy ensembling uses a two-stage stacking strategy which fits study-specific models and estimates ensemble weights separately. This approach ignores, however, the ensemble properties at the model-fitting stage, potentially resulting in performance losses. Motivated by challenges in the estimation of COVID-attributable mortality, we propose optimal ensemble construction, an approach to multistudy stacking whereby we jointly estimate ensemble weights and parameters associated with study-specific models. We prove that limiting cases of our approach yield existing methods such as multistudy stacking and pooling datasets before model fitting. We propose an efficient block coordinate descent algorithm to optimize the loss function. We use our method to perform multicountry COVID-19 baseline mortality prediction. We show that when little data is available for a country before the onset of the pandemic, leveraging data from other countries can substantially improve prediction accuracy. We further compare and characterize the method's performance in data-driven simulations and other numerical experiments. Our method remains competitive with or outperforms multistudy stacking and other earlier methods in the COVID-19 data application and in a range of simulation settings.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have