PM2.5 pollution is a major global concern, especially in Vietnam, due to its harmful effects on health and the environment. Monitoring local PM2.5 levels is crucial for assessing air quality. However, Vietnam's state-of-the-art (SOTA) dataset with a 3km resolution needs to be revised to depict spatial variation in smaller regions accurately. In this research, we investigated machine learning-based downscaling methods to improve the spatial resolution and quality of Vietnam's existing 3km PM2.5 products using different approaches: traditional machine learning models (random forest, XGBoost, Catboost, support vector regression (SVR), mixed effect model (MEM)) and deep learning models (long short-term memory (LSTM), convolutional neural network (CNN), convolutional LSTM (ConvLSTM)). Overall, the CatBoost 2-day lag model exhibited superior performance. In terms of modeling, integrating temporal factors into tree-based models can enhance predictive accuracy. Furthermore, when faced with small datasets, traditional machine learning models demonstrate superior performance over complex deep learning approaches. The validation of machine and deep learning models based on their PM2.5 generated maps is requested because these models can obtain very high results for model evaluation but are unrealistic for application. In this study, compared to the state-of-the-art (SOTA) PM2.5 maps in Vietnam and the SOTA global maps, the proposed CatBoost 2-day lag model's maps showed a 57% increase in the correlation coefficient (Pearson R), as well as 42-73%, 28-75%, and 39-75% reductions in root mean squared error (RMSE), mean relative error (MRE), and mean absolute error (MAE), respectively. Additionally, the daily, monthly, and year-average maps generated by the Catboost 2-day lag model effectively capture the spatial distribution and seasonal variations of PM2.5 in Ho Chi Minh City. These findings indicate a substantial enhancement in the accuracy and reliability of downscaled PM2.5 maps.
Read full abstract