ABSTRACT Water is an essential resource necessary for the survival of all life forms, yet it is continually at risk of contamination. Accurate water quality prediction is essential for protecting ecosystem health. This study aims to assess the effectiveness of ensemble learning techniques, namely AdaBoost, gradient boosting, XGBoost, CatBoost, and LightGBM, in predicting water quality parameters in the Bara River Basin, Pakistan. Initially, a random forest model was used to determine the input water quality parameters combination for the selected target water quality variable. Then, the ML models were developed for each combination of input parameters and target water quality variables. The ML model's performance was assessed via statistical performance indicators, namely R2, mean squared error, and mean absolute error. The most suitable model was highlighted using compromise programming. The results reveal that the XGBoost and gradient boosting models outperform other algorithms based on statistical indicators, displaying remarkable predictive ability with near-perfect R2 values for HCO3, CO3, and Mg on the XGBoost model and electrical conductivity, SO4, Temp, and Ca on the gradient boosting model. Whereas CatBoost and LightGBM have a more robust performance on some parameters, such as pH and dissolved solids while its performance in other water quality parameters was weak.
Read full abstract