Abstract
Predicting protein-ligand interactions using artificial intelligence (AI) models has attracted great interest in recent years. However, data-driven AI models unequivocally suffer from a lack of sufficiently large and unbiased datasets. Here, we systematically investigated the data biases on the PDBbind and DUD-E datasets. We examined the model performance of atomic convolutional neural network (ACNN) on the PDBbind core set and achieved a Pearson R2 of 0.73 between experimental and predicted binding affinities. Strikingly, the ACNN models did not require learning the essential protein-ligand interactions in complex structures and achieved similar performance even on datasets containing only ligand structures or only protein structures, while data splitting based on similarity clustering (protein sequence or ligand scaffold) significantly reduced the model performance. We also identified the property and topology biases in the DUD-E dataset which led to the artificially increased enrichment performance of virtual screening. The property bias in DUD-E was reduced by enforcing the more stringent ligand property matching rules, while the topology bias still exists due to the use of molecular fingerprint similarity as a decoy selection criterion. Therefore, we believe that sufficiently large and unbiased datasets are desirable for training robust AI models to accurately predict protein-ligand interactions.
Highlights
Structure-based virtual screening has been widely used to discover new ligands based on target structures (Kitchen et al, 2004; Shoichet, 2004; Irwin and Shoichet, 2016; Zhou et al, 2016; Wang et al, 2017; Lyu et al, 2019; Peng et al, 2019)
We evaluated the performance of atomic convolutional neural network (ACNN) model to predict protein-ligand binding affinities on the PDBbind datasets using different data splitting approaches
The former is represented by PDBbind, a collection of experimentally determined proteinligand complex structures with known binding affinities, which is reliable, but the amount of data is small and arguably suffers from the data redundancy caused by the protein and ligand similarity
Summary
Structure-based virtual screening (molecular docking) has been widely used to discover new ligands based on target structures (Kitchen et al, 2004; Shoichet, 2004; Irwin and Shoichet, 2016; Zhou et al, 2016; Wang et al, 2017; Lyu et al, 2019; Peng et al, 2019). The heart of molecular docking is the scoring function for estimation of binding affinities of protein-ligand complexes. The performance of virtual screening was evaluated on several public available benchmarking datasets, including the Community Structure-Activity Resource (CSAR) (Dunbar et al, 2011), the PDBbind (Liu et al, 2017), the Directory of Useful Decoys (DUD) (Huang et al, 2006b), and the Directory of Useful Decoys - Enhanced (DUD-E) (Mysinger et al, 2012). The CSAR and PDBbind datasets were compiled to facilitate the prediction of the binding affinities based on experimental complex structures. The DUD and DUD-E datasets were originally designed to assess docking enrichment performance by distinguishing the annotated actives from among a large database of computationally generated non-binding decoy molecules
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.