Abstract
The pKa of C-H acids is an important parameter in the fields of organic synthesis, drug discovery, and materials science. However, the prediction of pKa is still a great challenge due to the limit of experimental data and the lack of chemical insight. Here, a new model for predicting the pKa values of C-H acids is proposed on the basis of graph neural networks (GNNs) and data augmentation. A message passing unit (MPU) was used to extract the topological and target-related information from the molecular graph data, and a readout layer was utilized to retrieve the information on the ionization site C atom. The retrieved information then was adopted to predict pKa by a fully connected network. Furthermore, to increase the diversity of the training data, a knowledge-infused data augmentation technique was established by replacing the H atoms in a molecule with substituents exhibiting different electronic effects. The MPU was pretrained with the augmented data. The efficacy of data augmentation was confirmed by visualizing the distribution of compounds with different substituents and by classifying compounds. The explainability of the model was studied by examining the change of pKa values when a specific atom was masked. This explainability was used to identify the key substituents for pKa. The model was evaluated on two data sets from the iBonD database. Dataset1 includes the experimental pKa values of C-H acids measured in DMSO, while dataset2 comprises the pKa values measured in water. The results show that the knowledge-infused data augmentation technique greatly improves the predictive accuracy of the model, especially when the number of samples is small.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.