Abstract

AbstractModern statistics and machine learning typically involve large amounts of data coupled with computationally intensive methods. In a predictive modeling context, one seeks models that achieve high predictive accuracy on new datasets. This is typically implemented by partitioning the data into training and hold‐out data sets. The allocation is often conducted randomly, at the row level of the data matrix. In this work, we discuss an overlooked gap in machine learning and predictive modeling, the role of data structure and data generation process in the partitioning of observational data into training and hold‐out datasets. Ignoring such structures can lead to deficiencies in model generalizability and operationalization. We highlight that explicitly embracing the data generation structure to partition the data for validating predictive model is essential to the success of data science projects. The proposed approach is called befitting cross validation (BCV). It relies on an information quality perspective of analytics. This requires an assessment with inputs from domain experts, in contrast to automated approaches that are purely data driven. BCV is motivated by the objective of generating information quality with data and modeling. Two case studies are illustrating the proposed approach. One is based on a 96‐h burn‐in process applied to electro‐mechanical devices, implemented in order to reduce early failures at the customer site. The goal was to shorten the burn‐in process with a predictive model applied at 20 h. The other case study is combining tablet dissolution profiles and designed mixture experiments. The goal there was to match the tablet under test dissolution profiles with a brand tablet reference profile. These case studies demonstrate the methodological points made with BCV, which are generic in nature. We suggest that BCV principles should be always considered in the development of data‐driven predictive models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call