Abstract

Deep learning has brought a dramatic development in molecular property prediction that is crucial in the field of drug discovery using various representations such as fingerprints, SMILES, and graphs. In particular, SMILES is used in various deep learning models via character-based approaches. However, SMILES has a limitation in that it is hard to reflect chemical properties. In this paper, we propose a new self-supervised method to learn SMILES and chemical contexts of molecules simultaneously in pre-training the Transformer. The key of our model is learning structures with adjacency matrix embedding and learning logics that can infer descriptors via Quantitative Estimation of Drug-likeness prediction in pre-training. As a result, our method improves the generalization of the data and achieves the best average performance by benchmarking downstream tasks. Moreover, we develop a web-based fine-tuning service to utilize our model on various tasks.

Highlights

  • Deep learning has brought a dramatic development in molecular property prediction that is crucial in the field of drug discovery using various representations such as fingerprints, Simplified Molecular-Input Line-Entry System (SMILES), and graphs

  • Machine learning models use them as features with additional descriptors such as Partition coefficient, Hydrogen Bond Acceptors (HBA), Hydrogen Bond Donors (HBD), Polar Surface Area (PSA), etc

  • Simplified Molecular-Input Line-Entry System (SMILES) is a text representation of molecules in a single line. As it is sequential and composed of text, methods inspired by natural language processing (NLP) such as word ­embedding[7] and ­RNN8,9 have been proposed. ­Mol2Vec[10] is a molecular representation inspired by word2vec

Read more

Summary

Introduction

Deep learning has brought a dramatic development in molecular property prediction that is crucial in the field of drug discovery using various representations such as fingerprints, SMILES, and graphs. Simplified Molecular-Input Line-Entry System (SMILES) is a text representation of molecules in a single line As it is sequential and composed of text, methods inspired by NLP such as word ­embedding[7] and ­RNN8,9 have been proposed. Self-supervised ­learning[13,14,15] which does not need labeled data for training shows performance improvements in most tasks of natural language processing (NLP) by pre-training the model such as T­ ransformer[16]. Those methods build a huge language model that can be utilized for various tasks using the pre-training.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call