Abstract

The design of chemical formulations is a challenging, high-dimensional problem. In typical formulations, tens of thousands of ingredients are available for use, yet only a tiny fraction end up in a given formulation. Deformulation, the problem of reverse engineering the precise amounts of each ingredient starting from just a list of ingredients, is similarly challenging but is a key capability for staying up-to-date with industry competitors. Here, we take advantage of a large, curated formulations dataset from CAS, a division of the American Chemical Society, which offers a consistent and highly structured representation of the formulations and the chemical identities of their components to show that a variational autoencoder neural network learns meaningful representations of formulations in various product classes such as antiperspirants and oral care. Furthermore, it can be used in conjunction with a two-step sampling algorithm to generate accurate ingredient amount suggestions for deformulation. Deformulation using a variational autoencoder produces estimates that are significantly more accurate than nearest neighbor methods, extrapolates better to formulations that are significantly different than previously seen formulations, and provides a way to leverage large datasets for industrially relevant capabilities.

Highlights

  • Machine learning, over the past decade, has become a de facto tool for aiding the rapid discovery and optimization of chemicals and materials.[1−8] While machine learning, via the application of supervised predictive models, has made substantial and broadly applicable progress in structure− property predictions,[9−11] stability screening,[12,13] and chemical text mining,[14−18] the utility of generative models remains less explored

  • A hallmark of variational autoencoders (VAEs) is the clustering of similar data in the latent space, with smooth transitions between points corresponding to sensible transitions in real space.[40]

  • While some reverse engineering problems require the identification of chemical components and the elucidation of their quantities within the product, ingredient labeling requirements in consumer packaged goods remove most of the challenge of chemical identification in this space

Read more

Summary

Introduction

Over the past decade, has become a de facto tool for aiding the rapid discovery and optimization of chemicals and materials.[1−8] While machine learning, via the application of supervised predictive models, has made substantial and broadly applicable progress in structure− property predictions,[9−11] stability screening,[12,13] and chemical text mining,[14−18] the utility of generative models remains less explored. We apply recent advances in generative models toward the problem of chemical product deformulation by leveraging a dataset of chemical formulations provided by CAS, Received: March 12, 2021 Revised: September 13, 2021 Accepted: September 14, 2021 Published: September 23, 2021. Industrial & Engineering Chemistry Research pubs.acs.org/IECR a division of the American Chemical Society specializing in scientific information solutions (Figure 1).[35] We show that, across a variety of formulated product application areas, it is possible to train unsupervised generative models variational autoencoders (VAEs)[36] to enable rapid data-driven suggestions of formulation recipes. We further demonstrate that the latent spaces encoded by these VAEs group along intuitive dimensions such as the amount of solvent or the active ingredient found in the formulation and can be leveraged to reverse-engineer unknown recipes when the ingredient order is known

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.