Machine-learning models to replicate large-eddy simulations of air pollutant concentrations along boulevard-type streets

Moritz Lange,Henri Suominen,Emilia Oikarinen,Kai Puolamäki,Leena Järvi,Rafael Savvides,Mona Kurppa

doi:10.5194/gmd-14-7411-2021

Moritz Lange, Henri Suominen + Show 5 more

Open Access

https://doi.org/10.5194/gmd-14-7411-2021

Copy DOI

Journal: Geoscientific Model Development	Publication Date: Dec 2, 2021
Citations: 4	License type: CC BY 4.0

Affiliation: University of Helsinki

Abstract

Abstract. Running large-eddy simulations (LESs) can be burdensome and computationally too expensive from the application point of view, for example, to support urban planning. In this study, regression models are used to replicate modelled air pollutant concentrations from LES in urban boulevards. We study the performance of regression models and discuss how to detect situations where the models are applied outside their training domain and their outputs cannot be trusted. Regression models from 10 different model families are trained and a cross-validation methodology is used to evaluate their performance and to find the best set of features needed to reproduce the LES outputs. We also test the regression models on an independent testing dataset. Our results suggest that in general, log-linear regression gives the best and most robust performance on new independent data. It clearly outperforms the dummy model which would predict constant concentrations for all locations (multiplicative minimum RMSE (mRMSE) of 0.76 vs. 1.78 of the dummy model). Furthermore, we demonstrate that it is possible to detect concept drift, i.e. situations where the model is applied outside its training domain and a new LES run may be necessary to obtain reliable results. Regression models can be used to replace LES simulations in estimating air pollutant concentrations, unless higher accuracy is needed. In order to have reliable results, it is however important to do the model and feature selection carefully to avoid overfitting and to use methods to detect the concept drift.

Highlights

Exposure to ambient air pollution leads to cardiovascular and pulmonary diseases, and is estimated to cause 3 million premature deaths worldwide every year (Lelieveld et al, 2015; WHO, 2016), of which 0.8 million occur in Europe (Lelieveld et al, 2019)
Most of the previous studies on developing a statistical air pollution model using machine learning have been based on field measurements, and the spatiotemporal distribution of pollutants has been assessed by utilizing multiple stationary sites in model training (e.g. Araki et al, 2018; Yang et al, 2018)
Hierarchical model separating data based on rules A Bayesian kernel-based method for regression Ensemble method of decision trees Ordinary least squares linear regression Linear regression modelling log(pc + 1) Linear regression assuming Poisson distributed data Ensemble method of decision trees Non-linear kernel-based regression method Support vector regression modelling log(pc + 1) Combination of logistic regression and Poisson regression rpart kernlab xgboost lm lm glm randomForest e1071 e1071 pscl forest is an ensemble method that aggregates the predictions of multiple decision trees trained on random subsets of data and features

Summary

Introduction

Exposure to ambient air pollution leads to cardiovascular and pulmonary diseases, and is estimated to cause 3 million premature deaths worldwide every year (Lelieveld et al, 2015; WHO, 2016), of which 0.8 million occur in Europe (Lelieveld et al, 2019). In contrast to CFD, statistical models based on machine learning may offer a significantly less expensive alternative to predict urban air quality and pollutant dispersion. The number of studies conducting machine-learning-based air quality modelling has increased rapidly (Rybarczyk and Zalakeviciute, 2018). Machine learning allows finding a relationship between a target variable, e.g. the concentration of air pollutants in a certain location, and its predictors, which are often called features. Most of the previous studies on developing a statistical air pollution model using machine learning have been based on field measurements, and the spatiotemporal distribution of pollutants has been assessed by utilizing multiple stationary sites in model training The application of machine learning for emulating LES outputs of local-scale air pollutant dispersion in urban areas is investigated.

Methods and material

Large-eddy simulation datasets

Data pre-processing

Target variable

Features

Forward feature selection

Model descriptions

Performance measure

Experiments

Model training

Model selection

Model evaluation

Concept drift detection

Findings

Discussion and conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Machine-learning models to replicate large-eddy simulations of air pollutant concentrations along boulevard-type streets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Geoscientific Model Development

Lead the way for us

Similar Papers

A New Hardware Trojan Detection Technique using Class Weighted XGBoost Classifier
Richa Sharma ... Manisha Pattanaik
-
Richa Sharma, et. al.Richa Sharma ... Manisha Pattanaik
01 Jul 2020
01 Jul 2020

Evaluating the robustness of models developed from field spectral data in predicting African grass foliar nitrogen concentration using WorldView-2 image as an independent test dataset
Onisimo Mutanga ... Elfatih M Abdel-Rahman
International Journal of Applied Earth Observation and Geoinformation | VOL. 34
Onisimo Mutanga, et. al.Onisimo Mutanga ... Elfatih M Abdel-Rahman
06 Sep 2014
International Journal of Applied Earth Observation and Geoinformation | VOL. 34

Selecting the Best Set of Features for Efficient Intrusion Detection in 802.11 Networks
Mouhcine Guennoun ... Aboubakr Lbekkouri
-
Mouhcine Guennoun, et. al.Mouhcine Guennoun ... Aboubakr Lbekkouri
01 Apr 2008
Selecting the Best Set of Features for Efficient Intrusion Detection in 802.11 Networks
Mouhcine Guennoun ... Aboubakr Lbekkouri

What should be expected from feature selection in small-sample settings
Chao Sima ... Edward R Dougherty
Computer applications in the biosciences : CABIOS | VOL. 22
Chao Sima, et. al.Chao Sima ... Edward R Dougherty
26 Jul 2006
Computer applications in the biosciences : CABIOS | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Machine-learning models to replicate large-eddy simulations of air pollutant concentrations along boulevard-type streets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Geoscientific Model Development