Abstract
Heuristic and machine learning models for rank-ordering reaction templates comprise an important basis for computer-aided organic synthesis regarding both product prediction and retrosynthetic pathway planning. Their viability relies heavily on the quality and characteristics of the underlying template database. With the advent of automated reaction and template extraction software and consequently the creation of template databases too large for manual curation, a data-driven approach to assess and improve the quality of template sets is needed. We therefore systematically studied the influence of template generality, canonicalization, and exclusivity on the performance of different template ranking models. We find that duplicate and nonexclusive templates, i.e., templates which describe the same chemical transformation on identical or overlapping sets of molecules, decrease both the accuracy of the ranking algorithm and the applicability of the respective top-ranked templates significantly. To remedy the negative effects of nonexclusivity, we developed a general and computationally efficient framework to deduplicate and hierarchically correct templates. As a result, performance improved considerably for both heuristic and machine learning template ranking models, as well as multistep retrosynthetic planning models. The canonicalization and correction code is made freely available.
Highlights
Retrosynthesis, i.e., the proposal of precursors for a desired product, and forward reaction prediction, i.e., the proposal of possible products given a set of reactants, are central topics of organic chemistry
To filter out nonexclusive templates, our novel hierarchical correction scheme was utilized to arrive at exclusive template sets
Since the accuracy of a machine learning template recommendation scheme usually suffers from a large number of templates, it is desirable to keep the number of templates as low as possible, without sacrificing chemical plausibility of the recommended reactions
Summary
Retrosynthesis, i.e., the proposal of precursors for a desired product, and forward reaction prediction, i.e., the proposal of possible products given a set of reactants, are central topics of organic chemistry. More general templates are applicable to more molecules and decrease the overall number of classes, potentially increasing model performance They may lead to a large number of proposed precursors, some of which may not be chemically meaningful. Data-driven approaches to retrosynthesis usually rely on the automated extraction of reaction templates from reaction databases, for example, via the open-source package RDChiral.[13] Such template sets are, by nature, not as well curated and validated as manually crafted reaction rules. They can contain duplicate and nonexclusive templates and may suffer from too large or too small template sizes. This necessitates the development of efficient and scalable canonicalization and correction routines
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.