Abstract

Accurate ozone (O3) predictions is crucial for assessing its impact on public health and developing effective prevention and control measures. While ground-based observations are considered highly accurate, the limited and uneven spatial coverage poses challenges. Air quality models can provide complete spatiotemporal coverage but are subject to biases due to simplifications in physicochemical mechanisms and uncertainties associated with emission and meteorology inputs. In this study, we aimed to improve the accuracy of the predicted O3 concentrations from the Community Multiscale Air Quality (CMAQ) model by using the machine learning techniques. First, we compared three machine learning algorithms, namely Light Gradient Boosting Machine (LightGBM), Random Forest, and eXtreme Gradient Boosting. Results showed that LightGBM exhibited the highest correlation coefficient (R) of 0.84, making it the preferred algorithm for further analysis. Subsequently, two multi-source data prediction models based on LightGBM were constructed to improve the accuracy of predicted daily maximum 8-h O3 (O3-Max8h). The first model, referred to as LGBR, utilized the predicted air pollutant concentrations and meteorological fields as input variables, while the second model, named LGBR-CHAP, incorporated ChinaHighAirPollutants (CHAP) O3-Max8h as an additional input variable. Validation results demonstrated significant improvements in the LGBR and LGBR-CHAP models compared to the original CMAQ model. On the daily scale, the root mean square error and mean bias of the predicted O3-Max8h decreased by 3.15 and 2.07 μg/m3, respectively, for the LGBR model, while they decreased by 5.61 and 4.18 μg/m3, respectively, for the LGBR-CHAP model. On the monthly scale, the R of the original CMAQ model varied from 0.2 in the March to 0.91 in June, and LGBR improved all months (0.4–0.92) as did LGBR-CHAP (0.5–0.94). Spatially, the O3-Max8h simulated by the original CMAQ model is better in eastern China but less skillful in western China. After optimization by the LGBR and LGBR-CHAP models, the national average R improved from 0.77 to 0.83 and 0.88, respectively. The LGBR-CHAP model exhibited better predictive capacity than the LGBR model and was subsequently employed to generate high-resolution (10 km × 10 km) and full-coverage (100%) O3-Max8h data. This dataset will prove valuable for future air pollution and health studies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call