Abstract
Abstract Bayesian regression determines model parameters by minimizing the expected loss, an upper bound to the true generalization error. However, this loss ignores model form error, or misspecification, meaning parameter uncertainties are significantly underestimated and vanish in the large data limit. As misspecification is the main source of uncertainty for surrogate models of low-noise calculations, such as those arising in atomistic simulation, predictive uncertainties are systematically underestimated. 
We analyze the true generalization error of misspecified, near-deterministic surrogate models, a regime of broad relevance in science and engineering. We show that posterior parameter distributions must cover every training point to avoid a divergence in the generalization error and design a compatible \textit{ansatz} which incurs minimal overhead for linear models. The approach is demonstrated on model problems before application to thousand-dimensional datasets in atomistic machine learning. Our efficient misspecification-aware scheme gives accurate prediction and bounding of test errors in terms of parameter uncertainties, allowing this important source of uncertainty to be incorporated in multi-scale computational workflows.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have