Abstract

Machine Learning approaches are required to predict accurately on test samples that are distributionally different from training ones in the fields of drug discovery, computational biology, and cheminformatics. However, (i) labeled task-specific molecule data are often scarce, and (ii) poor generalization due to test molecules that are structurally different from those seen during training. To alleviate the problems, we propose a cloze-style self-supervised learning model (MolCloze) to obtain universal informative representations for molecular property prediction tasks. With carefully designed self-supervised tasks unifying generative- and discriminative-paradigm, MolCloze can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. To capture such complex information, we design two novel strategies - Structural Fingerprint Tokenization (SFT) for better tokenizing molecule graphs, and Normalized Graph Raw Shortcut-connection (NGRS) for better latent representations by training a deeper model. We pretrain the MolCloze model via three tasks, which are Unordered Masked Language Modeling (UMLM), Replaced Masked Token Detection (RMTD), and Contrastive Energy-based Unmasked Token Clozing (CE-UTC). Then, we transfer the pre-trained model to a broad range of downstream molecular property prediction tasks via minor architecture modification. Extensive experiments demonstrate the generalizability of MolCloze by predicting a broad range of chemical properties which are related to drug discovery. We also observe significant performance boost on different downstream molecular property prediction datasets, achieving higher performance than the state-of-the-art baseline approaches and previous pre-training techniques developed for molecule data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.