As industries transition into the Industry 4.0 paradigm, the relevance and interest in concepts like Digital Twin (DT) are at an all-time high. DTs offer direct avenues for industries to make more accurate predictions, rational decisions, and informed plans, ultimately reducing costs, increasing performance and productivity. Adequate operation of DTs in the context of smart manufacturing relies on an evolving data-set relating to the real-life object or process, and a means of dynamically updating the computational model to better conform to the data. This reliance on data is made more explicit when physics-based computational models are not available or difficult to obtain in practice, as it's the case in most modern manufacturing scenarios. For data-based model surrogates to "adequately" represent the underlying physics, the number of training data points must keep pace with the number of degrees of freedom in the model, which can be on the order of thousands. However, in niche industrial scenarios like the one in manufacturing applications, the availability of data is limited (on the order of a few hundred data points, at best), mainly because a manual measuring process typically must take place for a few of the relevant quantities, e.g., level of wear of a tool. In other words, notwithstanding the popular notion of big-data, there is still a stark shortage of ground-truth data when examining, for instance, a complex system's path to failure. In this work we present a framework to alleviate this problem via modern machine learning tools, where we show a robust, efficient and reliable pathway to augment the available data to train the data-based computational models.
 Small sample size data is a key limitation in performance in machine learning, in particular with very high dimensional data. Current efforts for synthetic data generation typically involve either Generative Adversarial Networks (GANs) or Variational AutoEncoders (VAEs). These, however, are are tightly related to image processing and synthesis, and are generally not suited for sensor data generation, which is the type of data that manufacturing applications produce. Additionally, GAN models are susceptible to mode collapse, training instability, and high computational costs when used for high dimensional data creation. Alternatively, the encoding of VAEs greatly reduces dimensional complexity of data and can effectively regularize the latent space, but often produces poor representational synthetic samples. Our proposed method thus incorporates the learned latent space from an AutoEncoder (AE) architecture into the training of the generation network in a GAN. The advantages of such scheme are twofold: \textbf{(\textit{i})} the latent space representation created by the AE reduces the complexity of the distribution the generator must learn, allowing for quicker discriminator convergence, and \textbf{(\textit{ii})} the structure in the sensor data is better captured in the transition from the original space to the latent space. Through time statistics (up to the fifth moment), ARIMA coefficients and Fourier series coefficients, we compare the synthetic data from our proposed AE+GAN model with the original sensor data. We also show that the performance of our proposed method is at least comparable with that of the Riemannian Hamiltonian VAE, which is a recently published data augmentation framework specifically designed to handle very small high dimensional data sets.
Read full abstract