DECIMER 1.0: deep learning for chemical image recognition using transformers

Kohulan Rajan,Achim Zielesny,Christoph Steinbeck

doi:10.1186/s13321-021-00538-8

Kohulan Rajan, Achim Zielesny + Show 1 more

Open Access

https://doi.org/10.1186/s13321-021-00538-8

Copy DOI

Abstract

The amount of data available on chemical structures and their properties has increased steadily over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, Optical Chemical Structure Recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based. The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50–100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.

Highlights

Scientists build on the results of their peers
A dataset with 1 million molecules was trained for 50 epochs on an Nvidia Tesla V100 Graphical Processing Unit (GPU) and the same model was trained on a Tensor Processing Units (TPU) V3-8 and TPU V3-32
Training a model on a V3-8 TPU helped by increasing training speed up to 4 times compared to a V100 GPU and by using a V3-32 TPU a 16 times faster training speed was achieved, see Fig. 4

Summary

Introduction

Scientists build on the results of their peers. Knowledge and data arising from previous research is shared through scientific publications and increasingly through the deposition of data in repositories. Most of the chemical data is published in the form of text and images in scientific publications [2]. Most of the data published is non-machine readable and manual curation is still the standard. This manual work is tedious and error-prone [4]. J Chem Inf Comput Sci 28:31–36 [cito:cites]. J Chem Inf Model 49:740–743 [cito:cites] [cito:citesAsAuthority]. Peryea T, Katzel D, Zhao T, Southall N, Nguyen D-T (2019) MOLVEC: Open source library for chemical structure recognition. In: Abstracts of papers of the American Chemical Society, vol 258 [cito:cites] [cito:citesAsAuthority]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of cheminformatics	Publication Date: Aug 17, 2021
Citations: 41	License type: open-access

R Discovery Prime

R Discovery Prime

DECIMER 1.0: deep learning for chemical image recognition using transformers

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of cheminformatics

Lead the way for us

Similar Papers

DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications
Kohulan Rajan ... Christoph Steinbeck
Nature Communications | VOL. 14
Kohulan Rajan, et. al.Kohulan Rajan ... Christoph Steinbeck
19 Aug 2023
Nature Communications | VOL. 14

DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
Kohulan Rajan ... Achim Zielesny
Journal of Cheminformatics | VOL. 13
Kohulan Rajan, et. al.Kohulan Rajan ... Achim Zielesny
08 Mar 2021
Journal of Cheminformatics | VOL. 13

Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture
Kohulan Rajan ... Christoph Steinbeck
Journal of Cheminformatics | VOL. 16
Kohulan Rajan, et. al.Kohulan Rajan ... Christoph Steinbeck
05 Jul 2024
Journal of Cheminformatics | VOL. 16

MPOCSR: optical chemical structure recognition based on multi-path Vision Transformer
Fan Lin ... Jianhua Li
Complex & Intelligent Systems | VOL. -
Fan Lin, et. al.Fan Lin ... Jianhua Li
22 Jul 2024
Complex & Intelligent Systems | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DECIMER 1.0: deep learning for chemical image recognition using transformers

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of cheminformatics