Abstract
Reliable uncertainty quantification for statistical models is crucial in various downstream applications, especially for drug design and discovery where mistakes may incur a large amount of cost. This topic has therefore absorbed much attention and a plethora of methods have been proposed over the past years. The approaches that have been reported so far can be mainly categorized into two classes: distance-based approaches and Bayesian approaches. Although these methods have been widely used in many scenarios and shown promising performance with their distinct superiorities, being overconfident on out-of-distribution examples still poses challenges for the deployment of these techniques in real-world applications. In this study we investigated a number of consensus strategies in order to combine both distance-based and Bayesian approaches together with post-hoc calibration for improved uncertainty quantification in QSAR (Quantitative Structure–Activity Relationship) regression modeling. We employed a set of criteria to quantitatively assess the ranking and calibration ability of these models. Experiments based on 24 bioactivity datasets were designed to make critical comparison between the model we proposed and other well-studied baseline models. Our findings indicate that the hybrid framework proposed by us can robustly enhance the model ability of ranking absolute errors. Together with post-hoc calibration on the validation set, we show that well-calibrated uncertainty quantification results can be obtained in domain shift settings. The complementarity between different methods is also conceptually analyzed.
Highlights
With the increasing scale of available datasets, deep learning methods have made tremendous impact in the chemical domain [1]
We investigated the performance of several consensus strategies that combine both distance-based and Bayesian uncertainty quantification approaches in the context of deep learning-based Quantitative Structure-Activity Relationship (QSAR) regression modeling
For an out-of-domain molecule containing a biphosphate group, var(μ(θ, x)) will be quite small since the model takes it as an in-domain sample that can be well explained by the posterior weights
Summary
With the increasing scale of available datasets, deep learning methods have made tremendous impact in the chemical domain [1]. Wang et al J Cheminform (2021) 13:69 model cannot give such information. This example shows that numerical results without a measure of veracity do not contain enough information for decision making [6]. Given the importance of uncertainty quantification, a plethora of methods have been proposed so far and employed in various cheminformatics tasks such as molecular property prediction [7], chemical reaction prediction [8], material property prediction [9], NMR spectral property prediction [10] and interatomic potential prediction [11]. Current mainstream uncertainty quantification methods used in the chemical domain can be divided into two categories: distancebased approaches and Bayesian approaches. While the common goal is the same, the representation of the distance between a molecule and the model training set is varied across different distance-based methods. Many classical methods use feature space distance defined by molecular fingerprints [12,13,14,15,16,17], while some recent studies have shown that the distance in latent space may yield superior performance [18, 19]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.