Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Umit V Ucak,Islambek Ashyrmamatov,Juyong Lee

doi:10.1186/s13321-023-00725-9

Umit V Ucak, Islambek Ashyrmamatov + Show 1 more

Open Access

https://doi.org/10.1186/s13321-023-00725-9

Copy DOI

Journal: Journal of Cheminformatics	Publication Date: May 29, 2023
Citations: 13	License type: open-access

Affiliation: Seoul National University

Abstract

Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Abstract

Talk to us

Similar Papers

More From: Journal of Cheminformatics

Lead the way for us

Similar Papers

Molecular sharing and molecular-specific representations for multimodal molecular property prediction
Xuecong Tian ... Chen Chen
Applied Soft Computing | VOL. 163
Xuecong Tian, et. al.Xuecong Tian ... Chen Chen
18 Jun 2024
Applied Soft Computing | VOL. 163

3DSGIMD: An accurate and interpretable molecular property prediction method using 3D spatial graph focusing network and structure-based feature fusion
Yanan Tian ... Huanxiang Liu
Future Generation Computer Systems | VOL. 161
Yanan Tian, et. al.Yanan Tian ... Huanxiang Liu
08 Jul 2024
Future Generation Computer Systems | VOL. 161

Studies on molecular properties prediction, antitubercular and antimicrobial activities of novel quinoline based pyrimidine motifs
N.C Desai ... A.R Trivedi
Bioorganic & Medicinal Chemistry Letters | VOL. 24
N.C Desai, et. al.N.C Desai ... A.R Trivedi
15 May 2014
Bioorganic & Medicinal Chemistry Letters | VOL. 24

Molecular geometric deep learning
Cong Shen ... Kelin Xia
Cell reports methods | VOL. 3
Cong Shen, et. al.Cong Shen ... Kelin Xia
23 Oct 2023
Cell reports methods | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Abstract

Talk to us

Similar Papers

More From: Journal of Cheminformatics