Mol‐BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction

Juncai Li,Xiaofei Jiang

doi:10.1155/2021/7181815

Abstract

Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end‐to‐end deep learning framework, named Mol‐BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large‐scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine‐tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol‐BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state‐of‐the‐art baselines, the results illustrate that our proposed Mol‐BERT can outperform the current sequence‐based methods and achieve at least 2% improvement on ROC‐AUC score on Tox21, SIDER, and ClinTox dataset.

Highlights

Identifying the molecular properties plays an essential part in drug discovery and material science, which can alleviate the costly and time-consuming process in comparison to the traditional experiment methods [1]
This is because MolBERT leverages the molecular representation pretrained on large- scale unlabeled SMILES sequences, while extended-connectivity fingerprint (ECFP) heavily relied on feature engineering
The reason could be that our method adopted the molecular representation to consider the structural feature of molecular substructures, which benefits to the performance

Summary

Introduction

Identifying the molecular properties (e.g., bioactivity and toxicity) plays an essential part in drug discovery and material science, which can alleviate the costly and time-consuming process in comparison to the traditional experiment methods [1] Such a process is usually known as molecular property prediction, and it is a fundamental task to explore the functionality of new drugs. Extended-connectivity fingerprint (ECFP) [11], as the most representative fingerprint method, was designed to generate different types of circular fingerprints that extracted the molecular structures of atom neighborhoods by using a fixed hash function [12] These obtained fingerprint representations would be sent to traditional machine learning models to perform further predictions, and it can be applied to a wide range of different models, such as logistic regression, support vector classification, kernel ridge regression, random forest, influence relevance voting, and multitask networks [13]. This line of researches heavily depends on the design of hand-crafted features and domain

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Wireless Communications and Mobile Computing	Publication Date: Jan 1, 2021
Citations: 14	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Mol‐BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Wireless Communications and Mobile Computing

Lead the way for us

Similar Papers

UBERT22: Unsupervised Pre-training of BERT for Low Resource Urdu Language
Bilal Tahir ... Muhammad Amir Mehmood
-
Bilal Tahir, et. al.Bilal Tahir ... Muhammad Amir Mehmood
14 Dec 2022
14 Dec 2022

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining
Guowei Xu ... Wenbiao Ding
-
Guowei Xu, et. al.Guowei Xu ... Wenbiao Ding
01 Jan 2020
01 Jan 2020

On the Explainability of Natural Language Processing Deep Models
Julia El Zini ... Mariette Awad
ACM Computing Surveys | VOL. 55
Julia El Zini, et. al.Julia El Zini ... Mariette Awad
03 Dec 2022
ACM Computing Surveys | VOL. 55

Identification of muscle-invasion status in bladder cancer patients using natural language processing and machine learning.
Ruixin Yang ... Di Zhu
Journal of Clinical Oncology | VOL. 40
Ruixin Yang, et. al.Ruixin Yang ... Di Zhu
20 Feb 2022
Journal of Clinical Oncology | VOL. 40

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mol‐BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Wireless Communications and Mobile Computing