Using Random Forest To Model the Domain Applicability of Another Random Forest Model

Robert P Sheridan

doi:10.1021/ci400482e

Abstract

In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities. We will call this traditional type of QSAR model an "activity model". The activity model can be used to predict the activities of molecules not in the training set. A relatively new subfield for QSAR is domain applicability. The aim is to estimate the reliability of prediction of a specific molecule on a specific activity model. A number of different metrics have been proposed in the literature for this purpose. It is desirable to build a quantitative model of reliability against one or more of these metrics. We can call this an "error model". A previous publication from our laboratory (Sheridan J. Chem. Inf. Model., 2012, 52, 814-823.) suggested the simultaneous use of three metrics would be more discriminating than any one metric. An error model could be built in the form of a three-dimensional set of bins. When the number of metrics exceeds three, however, the bin paradigm is not practical. An obvious solution for constructing an error model using multiple metrics is to use a QSAR method, in our case random forest. In this paper we demonstrate the usefulness of this paradigm, specifically for determining whether a useful error model can be built and which metrics are most useful for a given problem. For the ten data sets and for the seven metrics we examine here, it appears that it is possible to construct a useful error model using only two metrics (TREE_SD and PREDICTED). These do not require calculating similarities/distances between the molecules being predicted and the molecules used to build the activity model, which can be rate-limiting.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Using Random Forest To Model the Domain Applicability of Another Random Forest Model

Abstract

Talk to us

Similar Papers

More From: Journal of Chemical Information and Modeling

Lead the way for us

Journal: Journal of Chemical Information and Modeling	Publication Date: Nov 5, 2013
Citations: 92

Similar Papers

The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity.
Robert P Sheridan
Journal of Chemical Information and Modeling | VOL. 55
Robert P SheridanRobert P Sheridan
04 Jun 2015
Journal of Chemical Information and Modeling | VOL. 55

Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection
Igor V Tetko ... Alexander Tropsha
Journal of Chemical Information and Modeling | VOL. 48
Igor V Tetko, et. al.Igor V Tetko ... Alexander Tropsha
26 Aug 2008
Journal of Chemical Information and Modeling | VOL. 48

Obstructive sleep apnea predicts 10-year cardiovascular disease-related mortality in the Sleep Heart Health Study: a machine learning approach.
Ao Li ... Linda S Powers
Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine | VOL. 18
Ao Li, et. al.Ao Li ... Linda S Powers
26 Aug 2021
Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine | VOL. 18

Prediction of Lumbar Drainage-Related Meningitis Based on Supervised Machine Learning Algorithms.
Peng Wang ... Shuang Luo
Frontiers in public health | VOL. 10
Peng Wang, et. al.Peng Wang ... Shuang Luo
28 Jun 2022
Frontiers in public health | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Using Random Forest To Model the Domain Applicability of Another Random Forest Model

Abstract

Talk to us

Similar Papers

More From: Journal of Chemical Information and Modeling