GEN: highly efficient SMILES explorer using autodidactic generative examination networks

Ruud Van Deursen,Guillaume Godin,Peter Ertl,Igor V Tetko

doi:10.1186/s13321-020-00425-8

Abstract

Recurrent neural networks have been widely used to generate millions of de novo molecules in defined chemical spaces. Reported deep generative models are exclusively based on LSTM and/or GRU units and frequently trained using canonical SMILES. In this study, we introduce Generative Examination Networks (GEN) as a new approach to train deep generative networks for SMILES generation. In our GENs, we have used an architecture based on multiple concatenated bidirectional RNN units to enhance the validity of generated SMILES. GENs autonomously learn the target space in a few epochs and are stopped early using an independent online examination mechanism, measuring the quality of the generated set. Herein we have used online statistical quality control (SQC) on the percentage of valid molecular SMILES as examination measure to select the earliest available stable model weights. Very high levels of valid SMILES (95–98%) can be generated using multiple parallel encoding layers in combination with SMILES augmentation using unrestricted SMILES randomization. Our trained models combine an excellent novelty rate (85–90%) while generating SMILES with strong conservation of the property space (95–99%). In GENs, both the generative network and the examination mechanism are open to other architectures and quality criteria.

Highlights

Exploration of chemical space for the discovery of new molecules is a key challenge for the chemical community, e.g. pharmaceutical and olfactive industries [1, 2]
The property match between the distributions of the generated set and the training set remained stable at 98% and 94% for SMILES length and Heavy atom count (HAC), respectively
Our analysis showed that the decision to stop training early based on the percentage of valid molecules, did not affect the capability of the model to generate SMILES with a high degree of novelty

Summary

Introduction

Exploration of chemical space for the discovery of new molecules is a key challenge for the chemical community, e.g. pharmaceutical and olfactive industries [1, 2]. [12], Recurrent Neural Network (RNN) [6, 13,14,15], Generative Adversarial Networks (GANs) [16] and reinforcement learning (RL) [17] or generate molecules based on molecular graph representation [18] as well as other many approaches as reviewed by [19] Contrary to these earlier reports, we demonstrate that text learning on SMILES is highly efficient to explore the training space with a high degree of novelty. During training of the models, the learning progress of the generators is periodically examined using an independent online examination mechanism without feedback to the learning rate of the student In this GEN we use an online generator that applies a statistical quality control after every training epoch, measuring the percentage of validity for a statistical set of generated SMILES. Following excellent results of SMILES augmentation for smaller datasets to predict physicochemical properties [22,23,24] and generators [25], we have used SMILES augmentation to increase both the number and diversity of SMILES in the training set

Objectives

Methods

Results

Discussion

Conclusion