Abstract

Resolving spatial variability in ambient air pollutant and quantifying contributing factors are critical to human exposure assessment and effective pollution control. Data-driven techniques have been employed in air quality modeling due to their capability to capture the complex relationships in data as well as for the benefit of fast and easy implementation. In this study, we addressed two issues on model evaluation and interpretability by applying two common data-driven approaches, linear regression (LR) and random forest (RF) with potentially predictive land-use predictor variables to predict spatial variations of air pollution in an urban setting. The data came from the measurement of ambient nitrogen dioxide (NO2) concentrations in the Greater Vancouver Regional District in Canada. First, we showed that the model performance was sensitive to the division of training and test sets. Applying a limited number of hold-out validations or cross-validations and reporting the mean model metrics cannot capture the variability and fairly evaluate the model performance. We proposed repeated cross-validations (RCVs) as a reliable evaluation method that accounts for both mean and variance. Second, there is not a consistent approach to measure the importance of predictor variables and quantify their contributions among different types of data-driven models. Traditional approaches only reflect the relative importance among predictor variables in terms of predictive power without a quantification of contribution to the model output. We proposed to apply SHapley Additive exPlanations (SHAP), a Shapley-value-based explanation method based on the coalitional game theory, as a unifying framework to interpret and compare different types of data-driven methods. We showed that SHAP is capable of 1) calculating predictor variable’s contribution to each data point; 2) ranking the importance of predictor variables in terms of their contributions to the model output. The results indicated that different models may favor different predictor variables and result in different interpretability.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.