Elevating hourly PM2.5 forecasting in Istanbul, Türkiye: Leveraging ERA5 reanalysis and genetic algorithms in a comparative machine learning model analysis

Serdar Gündoğdu,Tolga Elbir

doi:10.1016/j.chemosphere.2024.143096

Abstract

Rapid urbanization and industrialization have intensified air pollution, posing severe health risks and necessitating accurate PM2.5 predictions for effective urban air quality management. This study distinguishes itself by utilizing high-resolution ERA5 reanalysis data for a grid-based spatial analysis of Istanbul, Türkiye, a densely populated city with diverse pollutant sources. It assesses the predictive accuracy of advanced machine learning (ML) models—Multiple Linear Regression (MLR), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting (LGB), Random Forest (RF), and Nonlinear Autoregressive with Exogenous Inputs (NARX). Notably, it introduces genetic algorithm optimization for the NARX model to enhance its performance. The models were trained on hourly PM2.5 concentrations from twenty monitoring stations across 2020–2021. Istanbul was divided into seven regions based on ERA5 grid distributions to examine PM2.5 spatial variability. Seventeen input variables from ERA5, including meteorological, land cover, and vegetation parameters, were analyzed using the Neighborhood Component Analysis (NCA) method to identify the most predictive variables. Comparative analysis showed that while all models provided valuable insights (RF > LGB > XGB > MLR), the NARX model outperformed them, particularly with the complex dataset used. The NARX model achieved a high R-value (0.89), low RMSE (5.24 μg/m³), and low MAE (2.94 μg/m³). It performed best in autumn and winter, with the highest accuracy in Region-1 (R-value 0.94) and the lowest in Region-5 (R-value 0.75). This study's success in a complex urban setting with limited monitoring underscores the robustness of the NARX model and the methodology's potential for global application in similar urban contexts. By addressing temporal and spatial variability in air quality predictions, this research sets a new benchmark and highlights the importance of advanced data analysis techniques for developing targeted pollution control strategies and public health policies.

Full Text