Machine learning is rapidly advancing the drug discovery process, significantly enhancing speed and efficiency. Innovation in computer-aided drug design is primarily driven by structure- and ligand-based approaches. When the number of known inhibitors for a target is limited, data augmentation strategies are often preferred to enhance model performance. In this study, we developed predictive machine learning models for structure-based drug discovery leveraging multiple traditional machine learning algorithms trained with target and ligand dynamics-aware datasets.To illustrate our approach, we present a composite model that combines classification and regression to predict YTHDF1 inhibitors, utilizing PLEC features. YTHDF1, a key m6A reader protein involved in mRNA translation, is implicated in various cancers, making it a promising therapeutic target. Traditional structure-based virtual screening (SBVS) using generic scoring functions has struggled to identify potent YTHDF1 inhibitors due to the protein's unique binding characteristics. To overcome this, we developed YTHDF1-specific machine learning scoring functions (MLSFs) to enhance SBVS efficacy.We employed various data augmentation techniques to generate a comprehensive dataset, incorporating multiple conformations of ligands and the YTHDF1 protein. We have trained 64 YTHDF1-specific MLSFs using four machine learning algorithms and evaluated them on ten test sets, focusing on their predictive and ranking power. Our results demonstrate that the artificial neural network with protein-ligand extended connectivity fingerprints (ANN-PLEC) outperforms other MLSFs, consistently achieving high area under the precision-recall curve (PR-AUC) of 0.87. This method shows promise for targets with limited quantities of active molecules, providing a viable path forward for drug discovery research. The ANN-PLEC scoring function is made freely available on GitHub for other researchers to access and utilize https://github.com/JuniML/SBVS-YTHDF1/.
Read full abstract