Exploring the GDB-13 chemical space using deep generative models

Josep Arús-Pous,Thomas Blaschke,Silas Ulander,Hongming Chen,Ola Engkvist,Jean-Louis Reymond

doi:10.1186/s13321-019-0341-z

Josep Arús-Pous, Thomas Blaschke + Show 4 more

Open Access

https://doi.org/10.1186/s13321-019-0341-z

Copy DOI

Abstract

Recent applications of recurrent neural networks (RNN) enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1% of the database) reproduces 68.9% of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using negative log-likelihood plots. Furthermore, we use a mathematical model based on the “coupon collector problem” that compares the trained model to an upper bound and thus we are able to quantify how much it has learned. We also suggest that this method can be used as a tool to benchmark the learning capabilities of any molecular generative model architecture. Additionally, an analysis of the generated chemical space was performed, which shows that, mostly due to the syntax of SMILES, complex molecules with many rings and heteroatoms are more difficult to sample.

Highlights

Finding novel molecules with specific properties is one of the main problems that drug discovery faces
Using negative log‐likelihood plots to guide the training process A model was trained with a set of 1 million compounds randomly obtained from generated database (GDB)-13
To view the progress of the training, negative log-likelihoods (NLLs) of the SMILES in the training, validation and sampled sets were calculated after training the model each epoch

Summary

Introduction

Finding novel molecules with specific properties is one of the main problems that drug discovery faces. One of the most common approaches to this is to explore chemical space by enumerating large virtual libraries, hoping to find a novel region of space containing useful structures. The drug-like chemical space is intractably large and a rough estimate would be at least 1 023 molecules [1]. One is to use implicit models, which do not store all molecules in a region of the chemical space but instead represent molecules indirectly. Techniques such as chemical space navigation by mutations [2] or creating reaction graphs have proven to be successful [3, 4]. By searching public databases that contain molecules obtained from various sources, e.g. ChEMBL [5], new

Objectives

Methods

Results

Conclusion