Abstract

Chemical diversity is one of the key term when dealing with machine learning and molecular generation. This is particularly true for quantum chemical datasets. The composition of which should be done meticulously since the calculation is highly time demanding. Previously we have seen that the most known quantum chemical dataset QM9 lacks chemical diversity. As a consequence, ML models trained on QM9 showed generalizability shortcomings. In this paper we would like to present (i) a fast and generic method to evaluate chemical diversity, (ii) a new quantum chemical dataset of 435k molecules, OD9, that includes QM9 and new molecules generated with a diversity objective, (iii) an analysis of the diversity impact on unconstrained and goal-directed molecular generation on the example of QED optimization. Our innovative approach makes it possible to individually estimate the impact of a solution to the diversity of a set, allowing for effective incremental evaluation. In the first application, we will see how the diversity constraint allows us to generate more than a million of molecules that would efficiently complete the reference datasets. The compounds were calculated with DFT thanks to a collaborative effort through the QuChemPedIA@home BOINC project. With regard to goal-directed molecular generation, getting a high QED score is not complicated, but adding a little diversity can cut the number of calls to the evaluation function by a factor of ten

Highlights

  • Many applications in the field of molecular chemistry rely on specific electronic properties

  • In a previous study we have shown that the most widely used quantum chemistry dataset for small organic molecules, QM9 [11], lacked chemical diversity [12]

  • On the other hand, when optimizing the QED property, we studied CheckMol, identified functional groups (IFGs) and shingles separately to observe the impact of the choice of descriptors

Read more

Summary

Introduction

Many applications in the field of molecular chemistry rely on specific electronic properties. In order to evaluate these properties precisely, quantum chemistry calculations are necessary. These calculations are costly in terms of time and computing resources. This can slow down the discovery of new compounds. One of the great hopes of using machine learning (ML) methods in chemistry is to be able to reduce the amount of Supervised ML methods greatly depend on the size and quality of the dataset for good performances in generalization. In a previous study we have shown that the most widely used quantum chemistry dataset for small organic molecules, QM9 [11], lacked chemical diversity [12]. A model trained on QM9 could be quite accurate for

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.