Multivariate statistical tools and machine learning (ML) techniques can deconvolute hyperspectral data and control the disparity between the number of samples and features in materials science. Nevertheless, the importance of generating sufficient high-quality sample replicates in training data cannot be overlooked, as it fundamentally affects the performance of ML models. Here, we present a quantitative analysis of time-of-flight secondary ion mass spectrometry (ToF-SIMS) spectra of a simple microarray system of two food dyes using partial least-squares (PLS, linear) and random forest (RF, nonlinear) algorithms. This microarray was generated by a high-throughput sample preparation and analysis workflow for fast and efficient acquisition of quality and reproducible spectra via ToF-SIMS. We drew insights from the bias-variance trade-off, investigated the performances of PLS and RF regression models as a function of training data size, and inferred the amount of data needed to construct accurate and reliable regression models. In addition, we found that the spectral concatenation of positive and negative ToF-SIMS spectra improved the model performances. This study provides an empirical basis for future design of high-throughput microarrays and multicomponent systems, for the purpose of analysis with ToF-SIMS and ML.
Read full abstract