The range of chemical databases available has dramatically increased in recent years, but the reliability and quality of their data are often negatively affected by human-error fidelity. The size of chemical databases can make manual data curation/checking of such sets time consuming; thus, automated tools to help this process are highly desirable. Herein, we propose the use of Graph Neural Networks (GNNs) to identifying potential stereochemical misassignments in the primary asymmetric catalysis literature. Our method relies on the use of an ensemble of GNN models to predict the expected stereoselectivity of exemplars for a particular asymmetric reaction. When the majority of these models do not correlate to the reported outcome, the point is labeled as a possible stereochemical misassignment. Such identified cases are few in number and more easily investigated for their cause. We demonstrate the use of this approach to spot potential literature stereochemical misassignments in the ketone products resulting from catalytic asymmetric 1,4-addition of organoboron nucleophiles to Michael acceptors in two different databases, each one using a different family of chiral ligands (bisphosphine and diene ligands). Our results demonstrate that this methodology is useful for curation of medium-sized databases, speeding this process significantly compared to complete manual curation/checking. In the datasets investigated, human expert checking was reduced to 2.2% and 3.5% of the total data exemplars.
Read full abstract