ORDerly: Data Sets and Benchmarks for Chemical Reaction Data.

Daniel S Wigh,Joe Arrowsmith,Alexei A Lapkin,Alexander Pomberger,Kobi C Felton

doi:10.1021/acs.jcim.4c00292

Daniel S Wigh, Joe Arrowsmith + Show 3 more

Open Access

https://doi.org/10.1021/acs.jcim.4c00292

Copy DOI

Abstract

Machine learning has the potential to provide tremendous value to life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction data sets for training machine learning models. Herein, we present ORDerly, an open-source Python package for the customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean United States patent data stored in ORD and generate data sets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on data sets generated with ORDerly for condition prediction and show that data sets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalization. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Chemical Information and Modeling	Publication Date: Apr 22, 2024
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data.

Abstract

Talk to us

Similar Papers

More From: Journal of Chemical Information and Modeling

Lead the way for us

Similar Papers

Machine learning techniques for diagnosis of rare diseases from medical images
Toochukwu Juliet Mgbole ... Michael Owusu Asiamah
World Journal of Advanced Research and Reviews | VOL. 24
Toochukwu Juliet Mgbole, et. al. Toochukwu Juliet Mgbole ... Michael Owusu Asiamah
30 Oct 2024
World Journal of Advanced Research and Reviews | VOL. 24

A Data-Driven Lens to Understand Human Biology: An Interview with Daphne Koller
Daphne Koller ... Malorye A Branca
GEN Biotechnology | VOL. 1
Daphne Koller, et. al.Daphne Koller ... Malorye A Branca
01 Jun 2022
GEN Biotechnology | VOL. 1

Application of GPA and PLSR in correlating sensory and chemical data sets
Seo-Jin Chung ... Ingolf U Grün
Food Quality and Preference | VOL. 14
Seo-Jin Chung, et. al.Seo-Jin Chung ... Ingolf U Grün
01 Apr 2003
Food Quality and Preference | VOL. 14

Machine Learning–Enabled NIR Spectroscopy. Part 2: Workflow for Selecting a Subset of Samples from Publicly Accessible Data
Hussain Ali ... Amrit Paudel
AAPS PharmSciTech | VOL. 24
Hussain Ali, et. al.Hussain Ali ... Amrit Paudel
10 Jan 2023
AAPS PharmSciTech | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data.

Abstract

Talk to us

Similar Papers

More From: Journal of Chemical Information and Modeling