Influence of Template Size, Canonicalization, and Exclusivity for Retrosynthesis and Reaction Prediction Applications.

Esther Heid,Andrea Aude,William H Green,Jiannan Liu

doi:10.1021/acs.jcim.1c01192

Esther Heid, Andrea Aude + Show 2 more

Open Access

https://doi.org/10.1021/acs.jcim.1c01192

Copy DOI

Abstract

Heuristic and machine learning models for rank-ordering reaction templates comprise an important basis for computer-aided organic synthesis regarding both product prediction and retrosynthetic pathway planning. Their viability relies heavily on the quality and characteristics of the underlying template database. With the advent of automated reaction and template extraction software and consequently the creation of template databases too large for manual curation, a data-driven approach to assess and improve the quality of template sets is needed. We therefore systematically studied the influence of template generality, canonicalization, and exclusivity on the performance of different template ranking models. We find that duplicate and nonexclusive templates, i.e., templates which describe the same chemical transformation on identical or overlapping sets of molecules, decrease both the accuracy of the ranking algorithm and the applicability of the respective top-ranked templates significantly. To remedy the negative effects of nonexclusivity, we developed a general and computationally efficient framework to deduplicate and hierarchically correct templates. As a result, performance improved considerably for both heuristic and machine learning template ranking models, as well as multistep retrosynthetic planning models. The canonicalization and correction code is made freely available.

Highlights

Retrosynthesis, i.e., the proposal of precursors for a desired product, and forward reaction prediction, i.e., the proposal of possible products given a set of reactants, are central topics of organic chemistry
To filter out nonexclusive templates, our novel hierarchical correction scheme was utilized to arrive at exclusive template sets
Since the accuracy of a machine learning template recommendation scheme usually suffers from a large number of templates, it is desirable to keep the number of templates as low as possible, without sacrificing chemical plausibility of the recommended reactions

Summary

Introduction

Retrosynthesis, i.e., the proposal of precursors for a desired product, and forward reaction prediction, i.e., the proposal of possible products given a set of reactants, are central topics of organic chemistry. More general templates are applicable to more molecules and decrease the overall number of classes, potentially increasing model performance They may lead to a large number of proposed precursors, some of which may not be chemically meaningful. Data-driven approaches to retrosynthesis usually rely on the automated extraction of reaction templates from reaction databases, for example, via the open-source package RDChiral.[13] Such template sets are, by nature, not as well curated and validated as manually crafted reaction rules. They can contain duplicate and nonexclusive templates and may suffer from too large or too small template sizes. This necessitates the development of efficient and scalable canonicalization and correction routines

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Chemical Information and Modeling	Publication Date: Dec 23, 2021
Citations: 12	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Influence of Template Size, Canonicalization, and Exclusivity for Retrosynthesis and Reaction Prediction Applications.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Chemical Information and Modeling

Lead the way for us

Similar Papers

Comparisons of heuristic, general statistical and machine learning models for landslide susceptibility prediction and mapping
Faming Huang ... Zizheng Guo
CATENA | VOL. 191
Faming Huang, et. al.Faming Huang ... Zizheng Guo
31 Mar 2020
CATENA | VOL. 191

A heuristic ( R, T) periodic review perishable inventory model with lead times
Huan Neng Chiu
International Journal of Production Economics | VOL. 42
Huan Neng ChiuHuan Neng Chiu
01 Nov 1995
International Journal of Production Economics | VOL. 42

Maximizing marketing impact: heuristic vs ensemble models for attribution modeling
Jitendra Gaur ... Rahul Bajaj
Global Knowledge, Memory and Communication | VOL. -
Jitendra Gaur, et. al.Jitendra Gaur ... Rahul Bajaj
19 Apr 2024
Global Knowledge, Memory and Communication | VOL. -

A Prediction Model for Spot LNG Prices Based on Machine Learning Algorithms to Reduce Fluctuation Risks in Purchasing Prices
Sun-Feel Yang ... Eul-Bum Lee
Energies | VOL. 16
Sun-Feel Yang, et. al.Sun-Feel Yang ... Eul-Bum Lee
23 May 2023
Energies | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Influence of Template Size, Canonicalization, and Exclusivity for Retrosynthesis and Reaction Prediction Applications.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Chemical Information and Modeling