Practical Model Selection for Prospective Virtual Screening.

Shengchao Liu,Spencer S. Ericksen,James L. Keck,Scott A. Wildman,Moayad Alnammi,F. Michael Hoffmann,Gene E. Ananiev,Anthony Gitter,Andrew F. Voter

doi:10.1021/acs.jcim.8b00363

Abstract

Virtual (computational) high-throughput screening provides a strategy for prioritizing compounds for experimental screens, but the choice of virtual screening algorithm depends on the data set and evaluation strategy. We consider a wide range of ligand-based machine learning and docking-based approaches for virtual screening on two protein–protein interactions, PriA-SSB and RMI-FANCM, and present a strategy for choosing which algorithm is best for prospective compound prioritization. Our workflow identifies a random forest as the best algorithm for these targets over more sophisticated neural network-based models. The top 250 predictions from our selected random forest recover 37 of the 54 active compounds from a library of 22,434 new molecules assayed on PriA-SSB. We show that virtual screening methods that perform well on public data sets and synthetic benchmarks, like multi-task neural networks, may not always translate to prospective screening performance on a specific assay of interest.

Highlights

IntroductionAfter a specific protein or mechanistic pathway is identified to play an essential role in a disease process, the search begins for a chemical or biological ligand that can perturb the action or abundance of the disease target in order to mitigate the disease phenotype
Drug discovery is time consuming and expensive
We critically evaluated a collection of virtual screening (VS) algorithms that include both structure-based and ligand-based methods, with a focus on the subset of quantitative structure−activity relationship ligand-based methods that use machine learning to predict active compounds for a target based on initial screening data

Summary

Introduction

After a specific protein or mechanistic pathway is identified to play an essential role in a disease process, the search begins for a chemical or biological ligand that can perturb the action or abundance of the disease target in order to mitigate the disease phenotype. A standard approach to discover a chemical ligand is to screen thousands to millions of candidate compounds against the target in biochemical- or cell-based assays via a process called high-throughput screening (HTS), which produces vast sets of valuable pharmacological data. Even though HTS assays are highly automated, screens of thousands of compounds sample only a small fraction of the millions of commercially available drug-like compounds. Cost and time preclude academic laboratories and even pharmaceutical companies from blindly testing the full set of drug-like compounds in HTS assays. There is a crucial need for an effective virtual screening (VS) process as a preliminary step in prioritizing compounds for HTS assays

Objectives

Methods

Results

Conclusion