Comparative Assessment of Individual and Ensemble Machine Learning Models for Efficient Analysis of River Water Quality

Abdulaziz Alqahtani,Muhammad Izhar Shah,Muhammad Faisal Javed,Ali Aldrees

doi:10.3390/su14031183

Abstract

The prediction accuracies of machine learning (ML) models may not only be dependent on the input parameters and training dataset, but also on whether an ensemble or individual learning model is selected. The present study is based on the comparison of individual supervised ML models, such as gene expression programming (GEP) and artificial neural network (ANN), with that of an ensemble learning model, i.e., random forest (RF), for predicting river water salinity in terms of electrical conductivity (EC) and dissolved solids (TDS) in the Upper Indus River basin, Pakistan. The projected models were trained and tested by using a dataset of seven input parameters chosen on the basis of significant correlation. Optimization of the ensemble RF model was achieved by producing 20 sub-models in order to choose the accurate one. The goodness-of-fit of the models was assessed through well-known statistical indicators, such as the coefficient of determination (R2), mean absolute error (MAE), root mean squared error (RMSE), and Nash–Sutcliffe efficiency (NSE). The results demonstrated a strong association between inputs and modeling outputs, where R2 value was found to be 0.96, 0.98, and 0.92 for the GEP, RF, and ANN models, respectively. The comparative performance of the proposed methods showed the relative superiority of the RF compared to GEP and ANN. Among the 20 RF sub-models, the most accurate model yielded the R2 equal to 0.941 and 0.938, with 70 and 160 numbers of corresponding estimators. The lowest RMSE values of 1.37 and 3.1 were yielded by the ensemble RF model on training and testing data, respectively. The results of the sensitivity analysis demonstrated that HCO3− is the most effective variable followed by Cl− and SO42− for both the EC and TDS. The assessment of the models on external criteria ensured the generalized results of all the aforementioned techniques. Conclusively, the outcome of the present research indicated that the RF model with selected key parameters could be prioritized for water quality assessment and management.

Highlights

Rivers are one of the essential components of surface water, which is needed for industrial processes, agricultural production, and hydroelectricity generation
The gene expression programming (GEP) model developed for water quality forecasts was chosen after completing a set of iterations with basic function sets and the smallest head size
An excellent prediction capability has shown by the random forest (RF) model compared to other methods, which highlighted the overall supremacy of the ensemble learning techniques; Two mathematical expressions were established for total dissolved solids (TDS) and electrical conductivity (EC) prediction, highlighting the uniqueness of the GEP method

Summary

Introduction

Rivers are one of the essential components of surface water, which is needed for industrial processes, agricultural production, and hydroelectricity generation. With the economic development and growing use of water resources, the surface water gets contaminated, lowering water quality and posing serious threats to human health. Some of the main responsible factors for water pollution are human-induced activities, resulting in sewage, industrial discharge, and wastewater from urban areas [2,3,4,5]. The present study considered the total dissolved solids (TDS) and electrical conductivity (EC) as water quality indicators. Both the TDS and EC are well-accepted parameters for measuring water quality, examining salt content and organic matter in water [6,7].

Methods

Results

Conclusion