The advantage of employing mid-infrared spectrometry for milk analysis in breeding lies in its ability to quickly generate millions of records. However, these records may be biased if the calibration process does not account for their spectral variability when constructing the predictive model. So, this study introduces a novel method for developing a World Representative Spectral Database (WRSD) to reduce the risks of spectral extrapolation when predicting dairy traits in new samples. Utilizing a 2-phase selection procedure that is both efficient and minimizes memory usage, we first generate a decomposition matrix via Principal Component Analysis (PCA) on a data set of 2,324,443 records. The next phase iterates spectral selection based on a location index from PCA scores, calculating spectra occurrence frequency for refined barycenter estimations. The chosen spectra's barycenter closely aligns with the entire data set, proving the efficacy of using just 3 principal components (PCs). Applied to 4 varied data sets, totaling over 21 million records, we select 583,440 spectra to represent spectral diversity, with selection rates between 2.00% and 14.88%. This selection illustrates the spectral variability across different dairy populations and data providers. Demonstrated through a hypothetical calibration set of 71 samples, the WRSD's utility for algorithm developers becomes apparent. This calibration set covers between 91.42 to 98.50% of the WRSD variability, except for the Irish data set (3.50%), indicating a need for additional samples to accurately represent Irish variability and minimize spectral extrapolation. This study offers valuable insights into the representativeness of training sets for capturing spectral variability within targeted dairy populations. While the current WRSD does not fully encompass global milk spectral diversity, its development underscores the importance of gathering more data and standardizing spectral information across spectrometer brands. Ultimately, the WRSD proves crucial not just for trait prediction but also for identifying abnormal milk samples, also marking a significant relevance in dairy science technology.
Read full abstract