Abstract

Regression modeling is becoming increasingly prevalent in organic chemistry as a tool for reaction outcome prediction and mechanistic interrogation. Frequently, to acquire the requisite amount of data for such studies, researchers employ combinatorial datasets to maximize the number of data points while limiting the number of discrete chemical entities required. An often-overlooked problem in modeling studies using combinatorial datasets is the tendency to fit on patterns in the datasets (i.e., the presence or absence of a reactant or catalyst) rather than to identify meaningful trends between descriptors and the response variable. Consequently, the generality and interpretability of such models suffer. This report illustrates these well-known pitfalls in a case study, demonstrates the necessary control experiments to identify when this property will be problematic, and suggests how to perform further validation to assess general applicability and interpretability of models trained using combinatorial datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.