Abstract
Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end‐to‐end deep learning framework, named Mol‐BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large‐scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine‐tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol‐BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state‐of‐the‐art baselines, the results illustrate that our proposed Mol‐BERT can outperform the current sequence‐based methods and achieve at least 2% improvement on ROC‐AUC score on Tox21, SIDER, and ClinTox dataset.
Highlights
Identifying the molecular properties plays an essential part in drug discovery and material science, which can alleviate the costly and time-consuming process in comparison to the traditional experiment methods [1]
This is because MolBERT leverages the molecular representation pretrained on large- scale unlabeled SMILES sequences, while extended-connectivity fingerprint (ECFP) heavily relied on feature engineering
The reason could be that our method adopted the molecular representation to consider the structural feature of molecular substructures, which benefits to the performance
Summary
Identifying the molecular properties (e.g., bioactivity and toxicity) plays an essential part in drug discovery and material science, which can alleviate the costly and time-consuming process in comparison to the traditional experiment methods [1] Such a process is usually known as molecular property prediction, and it is a fundamental task to explore the functionality of new drugs. Extended-connectivity fingerprint (ECFP) [11], as the most representative fingerprint method, was designed to generate different types of circular fingerprints that extracted the molecular structures of atom neighborhoods by using a fixed hash function [12] These obtained fingerprint representations would be sent to traditional machine learning models to perform further predictions, and it can be applied to a wide range of different models, such as logistic regression, support vector classification, kernel ridge regression, random forest, influence relevance voting, and multitask networks [13]. This line of researches heavily depends on the design of hand-crafted features and domain
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.