Abstract

AbstractThe performance of a classification model is often assessed in terms of how well it separates a set of known observations into appropriate classes. If the validation sets used for such analyses are redundant due to bias in sampling, the relevance of the conclusions drawn to prospective work in which new kinds of positives are sought may be compromised. In the case of the various virtual screening techniques used in modern drug discovery, such bias generally appears as over‐representation of particular structural subclasses in the test set. We show how clustering by substructural similarity, followed by applying arithmetic and harmonic weighting schemes to receiver operating characteristic (ROC) curves, can be used to identify validation sets that are biased due to such redundancies. This can be accomplished qualitatively by direct examination or quantitatively by comparing the areas under the respective linear or semilog curves (AUCs or pAUCs). Copyright © 2009 John Wiley & Sons, Ltd.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.