Evaluation of water quality indexes with novel machine learning and SHapley Additive ExPlanation (SHAP) approaches

Majid Khan,Mujahid Ali,Ali Aldrees,Abubakr Taha Bakheit Taha

doi:10.1016/j.jwpe.2024.104789

Abstract

Water quality indexes (WQI) are pivotal in assessing aquatic systems. Conventional modeling approaches rely on extensive datasets with numerous unspecified inputs, leading to time-consuming WQI assessment procedures. Numerous studies have used machine learning (ML) methods for WQI analysis but often lack model interpretability. To address this issue, this study developed five interpretable predictive models, including two gene expression programming (GEP) models, two deep neural networks (DNN) models, and one optimizable Gaussian process regressor (OGPR) model for estimating electrical conductivity (EC) and total dissolved solids (TDS). For the model development, a total of 372 records on a monthly basis were collected in the Upper Indus River at two outlet stations. The efficacy and accuracy of the models were assessed using various statistical measures, such as correlation (R), mean square error (MAE), root mean square error (RMSE), and 5-fold cross-validation. The DNN2 model demonstrated outstanding performance compared to the other five models, exhibiting R-values closer to 1.0 for both EC and TDS. However, the genetic programming-based models, GEP1 and GEP2, exhibited comparatively lower accuracy in predicting the water quality indexes. The SHapely Additive exPlanation (SHAP) analysis revealed that bicarbonate, calcium, and sulphate jointly contribute approximately 78 % to EC, while the combined presence of sodium, bicarbonate, calcium, and magnesium accounts for around 87 % of TDS in water. Notably, the influence of pH and chloride was minimal on both water quality indexes. In conclusion, the study highlights the cost-effective and practical potential of predictive models for EC and TDS in assessing and monitoring river water quality.

Full Text