Machine learning (ML) models now play a crucial role in predicting properties essential to drug development, such as a drug's logscale acid-dissociation constant (pKa). Despite recent architectural advances, these models often generalize poorly to novel compounds due to a scarcity of ground-truth data. Further, these models lack interpretability. To this end, with deliberate molecular embeddings, atomic-resolution information is accessible in chemical structures by observing the model response to atomic perturbations of an input molecule. Here, we present BCL-XpKa, a deep neural network (DNN)-based multitask classifier for pKa prediction that encodes local atomic environments through Mol2D descriptors. BCL-XpKa outputs a discrete distribution for each molecule, which stores the pKa prediction and the model's uncertainty for that molecule. BCL-XpKa generalizes well to novel small molecules. BCL-XpKa performs competitively with modern ML pKa predictors, outperforms several models in generalization tasks, and accurately models the effects of common molecular modifications on a molecule's ionizability. We then leverage BCL-XpKa's granular descriptor set and distribution-centered output through atomic sensitivity analysis (ASA), which decomposes a molecule's predicted pKa value into its respective atomic contributions without model retraining. ASA reveals that BCL-XpKa has implicitly learned high-resolution information about molecular substructures. We further demonstrate ASA's utility in structure preparation for protein-ligand docking by identifying ionization sites in 93.2% and 87.8% of complex small molecule acids and bases. We then applied ASA with BCL-XpKa to identify and optimize the physicochemical liabilities of a recently published KRAS-degrading PROTAC.
Read full abstract