Abstract
Simplified Molecular-Input Line-Entry System (SMILES) is one of a widely used molecular representation methods for molecular property prediction. We conjecture that all the characters in the SMILES string of a molecule are essential for making up the molecules, but most of them make little contribution to determining a particular property of the molecule. Therefore, we verified the conjecture in the pre-experiment. Motivated by the result, we propose to inject proper noisy information into the SMILES to augment the training data by increasing the diversity of the labeled molecules. To this end, we explore injecting perturbing noise into the original labeled SMILES strings to construct augmented data for alleviating the limitation of the labeled compound data and enhancing the model to extract more useful molecular representation for molecular property prediction. Specifically, we directly adopt mask, swap, deletion, and fusion operations on SMILES strings to randomly mask, swap, and delete atoms in SMILES strings. Then, the augmented data is used by two strategies: each epoch alternately feeds the original and perturbing noisy molecules, or each batch alternately feeds the original and perturbing noisy molecules. We conduct experiments on both Transformer and BiGRU models to validate the effectiveness by adopting widely used datasets from MoleculeNet and ZINC. Experimental results demonstrate that the proposed method outperforms strong baselines on all the datasets. NoiseMol obtains the best performance on BBBP and FDA when compared with state-of-the-art methods. Besides, NoiseMol achieves the best accuracy on LogP. Therefore, injecting perturbing noise into the labeled SMILES strings is an effective and efficient method, which improves the prediction performance, generalization, and robustness of the deep learning models.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.