In epidemiological research, accurate estimation of historical ground-level ozone (O3) concentrations with enhanced spatiotemporal resolution is crucial for effective exposure assessment. The current state-of-the-art for estimating air pollutant concentrations is a two-stage ensemble method that integrates outputs from multiple machine learning algorithms. Despite its effectiveness, opportunities exist to refine this approach for more precise O3 estimation. In this study, we propose an enhanced ensemble method that incorporates four key strategies. First, we employ high-resolution spatiotemporal predictors derived from prior machine learning studies for refined secondary learning. Second, we use sophisticated algorithms, including categorical gradient boosting, deep neural network, random forest, stochastic variable Gaussian process, transformer, and a combination of convolutional neural network and long short-term memory neural network, as sublearners to enhance learning capabilities. Third, we spatiotemporally split the sample set and then train submodels separately on each subset to eliminate the unobserved spatiotemporal heterogeneity. Finally, we apply a complex machine learning algorithm, rather than the generalized additive model, for integrating sublearner predictions, enabling the capture of intricate nonlinear relationships beyond basic spatiotemporal linear weights. To validate these improvements, we estimated daily maximum 8-h moving average O3 concentrations ([O3]MDA8) across Chinese mainland from 2013 to 2020 at a 1 km spatial resolution. The proposed method demonstrated notable accuracy, achieving an out-of-station determination coefficient (R2) of 0.943 and a root-mean-square error (RMSE) of 10.197 μg/m3. This performance marks a nearly 15% improvement over the best existing Chinese O3 exposure model based on a single algorithm and also surpasses previous studies utilizing traditional ensemble methods for other air pollutants. Our enhanced ensemble approach significantly bolsters the reliability and robustness of future environmental epidemiological studies by further mitigating “misclassification” errors.
Read full abstract