Abstract
Molecular latent representations, derived from autoencoders (AEs), have been widely used for drug or material discovery over the past couple of years. In particular, a variety of machine learning methods based on latent representations have shown excellent performance on quantitative structure–activity relationship (QSAR) modeling. However, the sequence feature of them has not been considered in most cases. In addition, data scarcity is still the main obstacle for deep learning strategies, especially for bioactivity datasets. In this study, we propose the convolutional recurrent neural network and transfer learning (CRNNTL) method inspired by the applications of polyphonic sound detection and electrocardiogram classification. Our model takes advantage of both convolutional and recurrent neural networks for feature extraction, as well as the data augmentation method. According to QSAR modeling on 27 datasets, CRNNTL can outperform or compete with state-of-art methods in both drug and material properties. In addition, the performances on one isomers-based dataset indicate that its excellent performance results from the improved ability in global feature extraction when the ability of the local one is maintained. Then, the transfer learning results show that CRNNTL can overcome data scarcity when choosing relative source datasets. Finally, the high versatility of our model is shown by using different latent representations as inputs from other types of AEs.
Highlights
For the excavation of crucial molecular factors on properties and activities, quantitative structure-activity relationship (QSAR) has been an active research area in the past 50+years
AEs are shown as a generative algorithm for de novo design studies in the beginning [8,9], latent representations from encoders of AEs have been extracted for QSAR modeling [6]
We describe the convolutional recurrent neural network and transfer learning (CRNNTL) method to tackle the problems with molecular sequence and data scarcity
Summary
For the excavation of crucial molecular factors on properties and activities, quantitative structure-activity relationship (QSAR) has been an active research area in the past 50+years. In QSAR, the molecular representations (or descriptors), as the input features of the modeling, represent chemical information of actual entities in computer-understandable numbers [1,2,3]. Molecular fingerprints, such as extended-connectivity fingerprints (ECFPs), have been widely used as representations for the modeling in drug and material discovery [4]. AEs are shown as a generative algorithm for de novo design studies in the beginning [8,9], latent representations from encoders of AEs have been extracted for QSAR modeling [6].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.