Raman spectra are examples of high dimensional data that can often be limited in the number of samples. This is a primary concern when Deep Learning frameworks are developed for tasks such as chemical species identification, quantification, and diagnostics. Open-source data are difficult to obtain and often sparse; furthermore, the collecting and curating of new spectra require expertise and resources. Deep generative modeling utilizes Deep Learning architectures to approximate high dimensional distributions and aims to generate realistic synthetic data. The evaluation of the data and the performance of the deep models is usually conducted on a per-task basis and provides no indication of an increase to robustness, or generalization, on a wider scale. In this study, we compare the benefits and limitations of a standard statistical approach to data synthesis (weighted blending) with a popular deep generative model, the Variational Autoencoder. Two binary data sets are divided into 3-fold to simulate small, limited samples. Synthetic data distributions are created per fold using the two methods and then augmented into the training of two Deep Learning algorithms, a Convolutional Neural Network and a Fully-Connected Neural Network. The goal of this study is to observe the trends in learning as synthetic data are continually augmented to the training data in increasing batches. To determine the impact of each synthetic method, Principal Component Analysis and the discrete Fréchet distance are implemented to visualize and measure the distance between the source and synthetic distributions along with the Machine Learning metric balanced accuracy for evaluating performance on imbalanced data.
Read full abstract