Towards better process management in wastewater treatment plants: Process analytics based on SHAP values for tree-based machine learning methods

Dong Wang,Sven Thunéll,Ulrika Lindberg,Lili Jiang,Johan Trygg,Mats Tysklind

doi:10.1016/j.jenvman.2021.113941

Dong Wang, Sven Thunéll + Show 4 more

Open Access

https://doi.org/10.1016/j.jenvman.2021.113941

Copy DOI

Journal: Journal of Environmental Management	Publication Date: Oct 15, 2021
Citations: 115	License type: cc-by

Affiliation: Umeå University

Abstract

Understanding the mechanisms of pollutant removal in Wastewater Treatment Plants (WWTPs) is crucial for controlling effluent quality efficiently. However, the numerous treatment units, operational factors, and the underlying interactions between these units and factors usually obfuscate the comprehensive and precise understanding of the processes. We have previously proposed a machine learning (ML) framework to uncover complex cause-and-effect relationships in WWTPs. However, only one interpretable ML model, Random forest (RF), was studied and the interpretation method was not granular enough to reveal very detailed relationships between operational factors and effluent parameters. Thus, in this paper, we present an upgraded framework involving three interpretable tree-based models (RF, XGboost and LightGBM), three metrics (R2, Root mean squared error (RMSE), and Mean absolute error (MAE)) and a more advanced interpretation system SHapley Additive exPlanations (SHAP). Details of the framework are provided along with a demonstration of its practical applicability based on a case study of the Umeå WWTP in Sweden. Results show that, for both labels TSSe (Total suspended solids in effluent) and PO4e (Phosphate in effluent), the XGBoost models are optimal whereas the RF models are the least optimal, due to overfitting and polarized fitting. This study has yielded multiple new and significant findings with respect to the control of TSSe and PO4e in the Umeå WWTP and other similarly configured WWTPs. Additionally, this study has produced two important generic findings relating to ML applications for WWTPs (or even other process industries) in terms of cause-and-effect investigations. First, the model comparison should be carried out from multiple perspectives to ensure that underlying details are fully revealed and examined. Second, using a precise, robust, and granular (feature attribution available for individual instances) explanation method can bring extra insight into both model comparison and model interpretation. SHAP is recommended as we found it to be of great value in this study.

Highlights

Wastewater Treatment Plants (WWTPs) are systems used to remove various pollutants from collected wastewater to ensure the effluent discharged into the water cycle complies with the regulations and has minimal influence on the environment (Russell, 2019)
Gradient boosting decision tree (GBDT) (Friedman, 2001) is an ensemble method in machine learning where multiple weak learners are combined to form a single strong learner. It is different from bagging methods, and is characterized by the sequential and iterative learning process in which the current regression tree is trained using the residuals from the previous tree
In the tuning process of hyperparameters, the models’ generalization performance i.e. to prevent overfitting of the models was prioritized. This was to guarantee that the results of the SHapley Additive exPlanations (SHAP) interpretation carried out on the training data were applicable to future unknown data

Summary

Introduction

Wastewater Treatment Plants (WWTPs) are systems used to remove various pollutants from collected wastewater to ensure the effluent discharged into the water cycle complies with the regulations and has minimal influence on the environment (Russell, 2019). In a recently published paper (Wang et al, 2021), based on online monitored process data, we proposed a novel machine learning (ML) framework that can be used to uncover the precise relationship between operational factors and effluent quality In this framework, a Deep Neural Network (DNN) model (Sugiyama, 2019) was used to validate whether the Random Forest (RF) model (Breiman, 2001, 2002) captured sufficient variance to support the further RF model interpre tation – Variable Importance Measure (VIM) analysis and Partial Dependence Plot (PDP) analysis (Friedman, 2001). VIM was carried out to identify the most influential operational factors on effluent quality, and PDP was carried out to investigate how those influential factors affect effluent quality This framework, and its case study on the local Umeå WWTP in Sweden, helped in the development of a more advanced control strategy to optimize the usage of chemicals and energy without compromising effluent quality. The expansion of tree-based models and the adoption of the SHAP system are justified both theoretically and through the case study details to serve as a reference for the studies with the intention of understanding WWTP processes through ML implementation

Processes and data sources in the umeå WWTP

Data transformation

XGBoost and LightGBM

Framework of study

TSSe models

PO4e models

Significance of study

Conclusions

Declaration of competing interest

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Towards better process management in wastewater treatment plants: Process analytics based on SHAP values for tree-based machine learning methods

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Environmental Management

Lead the way for us

Similar Papers

A machine learning framework to improve effluent quality control in wastewater treatment plants
Dong Wang ... Nabil Souihi
Science of The Total Environment | VOL. 784
Dong Wang, et. al.Dong Wang ... Nabil Souihi
16 Apr 2021
Science of The Total Environment | VOL. 784

Effect of Hydraulic Retention Time on the Performance of High-Rate Activated Sludge System: a Pilot-Scale Study
H Guven ... D Sancar
Water, Air, & Soil Pollution | VOL. 228
H Guven, et. al.H Guven ... D Sancar
20 Oct 2017
Water, Air, & Soil Pollution | VOL. 228

Bayesian optimization based random forest and extreme gradient boosting for the pavement density prediction in GPR detection
Yifang Chen ... Yijie Su
Construction and Building Materials | VOL. 387
Yifang Chen, et. al.Yifang Chen ... Yijie Su
09 May 2023
Construction and Building Materials | VOL. 387

Development of interpretable machine learning models to predict in-hospital prognosis of acute heart failure patients.
Munekazu Tanaka ... Yusuke Yoshikawa
ESC Heart Failure | VOL. -
Munekazu Tanaka, et. al.Munekazu Tanaka ... Yusuke Yoshikawa
15 May 2024
ESC Heart Failure | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards better process management in wastewater treatment plants: Process analytics based on SHAP values for tree-based machine learning methods

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Environmental Management