Abstract
Machine learning (ML) methods have been applied extensively to simulate air pollutant concentrations and assess individual exposure in epidemiological studies. However, there is still a paucity of research on the temporal heterogeneity of ML model performance and the impact of dataset size. To explore the temporal heterogeneity in model performance when estimating daily concentrations of fine particulate matter (PM2.5) across China in 2021, we compared five decision tree-based ML models (Random Forest (RF), Categorical Boosting (CatBoost), Gradient Boost Regression Tree (GBRT), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM)) across daily scales within three distinct timeframes. The performance of all models was evaluated using cross-validation. We observed that the performance of ML models varied with time, which showed a significant correlation with PM2.5 concentration. Among the 365 days in 2021, RF model performed best, the annual mean R2 was 0.86, with a minimum of 0.84 and a maximum of up to 0.95. For RF, we chose a cubic polynomial curve to fit the relationship between model performance and PM2.5 concentrations, and based on this, we devised a model selection strategy for different time scales, achieving an accuracy rate of up to 79.45 %, with the selected models having an average R2 of 0.85, and a maximum of up to 0.95. Additionally, we found that increasing the dataset size did not significantly improve model performance. Instead, it resulted in considerably longer runtime and increased memory usage. The methodology and findings of this study hold significant value for advancing the development of more efficient and precise modeling approaches for air pollutant concentrations. Furthermore, this research provides a foundation for regional air pollutant governance and future health-related research.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.