Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets.

Alessandra Toniato,Marzena Maria Lehmann,Alain C Vaucher,Teodoro Laino,Torsten Luksch,Philippe Schwaller,Marco Stenta

doi:10.1021/acs.chemmater.3c01406

Alessandra Toniato, Marzena Maria Lehmann + Show 5 more

Open Access

https://doi.org/10.1021/acs.chemmater.3c01406

Copy DOI

Abstract

The world is on the verge of a new industrial revolution, and language models are poised to play a pivotal role in this transformative era. Their ability to offer intelligent insights and forecasts has made them a valuable asset for businesses seeking a competitive advantage. The chemical industry, in particular, can benefit significantly from harnessing their power. Since 2016 already, language models have been applied to tasks such as predicting reaction outcomes or retrosynthetic routes. While such models have demonstrated impressive abilities, the lack of publicly available data sets with universal coverage is often the limiting factor for achieving even higher accuracies. This makes it imperative for organizations to incorporate proprietary data sets into their model training processes to improve their performance. So far, however, these data sets frequently remain untapped as there are no established criteria for model customization. In this work, we report a successful methodology for retraining language models on reaction outcome prediction and single-step retrosynthesis tasks, using proprietary, nonpublic data sets. We report a considerable boost in accuracy by combining patent and proprietary data in a multidomain learning formulation. This exercise, inspired by a real-world use case, enables us to formulate guidelines that can be adopted in different corporate settings to customize chemical language models easily.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Chemistry of materials : a publication of the American Chemical Society	Publication Date: Oct 27, 2023
Citations: 1	License type: CC BY-NC-ND 4.0

R Discovery Prime

R Discovery Prime

Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets.

Abstract

Talk to us

Similar Papers

More From: Chemistry of materials : a publication of the American Chemical Society

Lead the way for us

Similar Papers

Are AI language models such as ChatGPT ready to improve the care of individuals with epilepsy?
Christian M Boßelmann ... Dennis Lal
Epilepsia | VOL. 64
Christian M Boßelmann, et. al.Christian M Boßelmann ... Dennis Lal
13 Mar 2023
Epilepsia | VOL. 64

Learning to Read and Write in the Language of Proteins
Helen T Hobbs ... Chang C Liu
GEN Biotechnology | VOL. 2
Helen T Hobbs, et. al.Helen T Hobbs ... Chang C Liu
01 Apr 2023
GEN Biotechnology | VOL. 2

StaResGRU-CNN with CMedLMs: A stacked residual GRU-CNN with pre-trained biomedical language models for predictive intelligence
Pin Ni ... Victor Chang
Applied Soft Computing | VOL. 113
Pin Ni, et. al.Pin Ni ... Victor Chang
13 Oct 2021
Applied Soft Computing | VOL. 113

On the Proper Use of Mass Accuracy in Proteomics
Roman Zubarev ... Matthias Mann
Molecular & Cellular Proteomics | VOL. 6
Roman Zubarev, et. al.Roman Zubarev ... Matthias Mann
01 Mar 2007
Molecular & Cellular Proteomics | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets.

Abstract

Talk to us

Similar Papers

More From: Chemistry of materials : a publication of the American Chemical Society