The increasing pollution of aquifers by human activities over recent decades poses a threat to drinking water safety. While Gaussian Process Regression (GPR) is a robust tool for predicting and monitoring water quality, its effectiveness is hindered limitations of available data on model training and validation, known as the “small sample problem”. Various attempts to resolve this problem include virtual sample generation (VSG). This study aimed to increase the accuracy of GPR for predicting water quality in situations of limited datasets. Three VSG methods, namely Multi Distribution Mega-Trend Diffusion (MD-MTD), Generative Adversarial Network (GAN), and t-distributed stochastic nearest neighbor embedding (t-SNE) were compared for enhancing the accuracy of GPR model prediction of Strontium (Sr2+). The models were used to predict Sr2+ in the shallow aquifer system in Songyuan, Jilin Province. The results showed that t-SNE provided the most significant improvement to the accuracy of the GPR, with R2 increasing from 0.86 to 0.99 (12.98 %), followed by MD-MTD (R2 of 0.95, 9.39 %), with the least improvement obtained by GAN (R2 of 0.92, 5.98 %). Boxplots show that MD-MTD-GPR predictions do not fully capture observed data distributions. GANs accurately replicate the data distribution, while t-SNE-GPR achieves the highest prediction accuracy and handles data fluctuations. GPR accuracy improves with an increasing number of virtual samples but tends to decrease when the number exceeds 258 in this study. This study can guide the improvement of the accuracy of GPR for situations of limited datasets. The results of this study can help improve water quality management and drinking water safety in regions with sparse monitoring data.
Read full abstract