The acid dissociation constant (pKa), which quantifies the propensity for a solute to donate a proton to its solvent, is crucial for drug design and synthesis, environmental fate studies, chemical manufacturing, and many other fields. Unfortunately, the terminology used for describing acid-base phenomena is sometimes inconsistent, causing large potential for misinterpretation. In this work, we examine a systematic confusion underlying the definition of "acidic" and "basic" pKa values for zwitterionic compounds. Due to this confusion, some pKa data are misrepresented in data repositories, including the widely used and highly trusted ChEMBL database. Such datasets are frequently used to supply training data for pKa prediction models, and hence, confusion and errors in the data make the model performance worse. Herein, we discuss the intricacies of this issue. We make suggestions for describing acid-base phenomena, training pKa prediction models, and stewarding pKa datasets, given the high potential for confusion and potentially high impact in downstream applications.
Read full abstract