ABSTRACT Estimation of the water quality parameters is important to enhance time and cost-effectiveness to that of the conventional approach. This study is aimed to identify the best machine learning (ML) approach to predict concentrations of biochemical oxygen demand (BOD), nitrate, and phosphate. Four ML techniques including decision tree, random forest, gradient boosting, and XGBoost were compared to estimate the water quality parameters based on biophysical (i.e., population, basin area, river slope, water level, and stream flow) and physicochemical properties (i.e., conductivity, turbidity, pH, temperature, and dissolved oxygen) as input parameters. The data were split into training and test sets, and the model performances were evaluated using coefficient of determination (R2), Nash–Sutcliffe efficiency coefficient (NSE), and root mean squared error (RMSE). The mean squared error (MSE) was used as the optimization target. The robust fivefold cross-validation, along with hyperparameter tuning, achieved R2 values of 0.76, 0.67, and 0.71 for phosphate, nitrate, and BOD, and NSE values of 0.73, 0.67, and 0.66, respectively. XGBoost yielded the lowest RMSE across all parameters, showcasing superior performance when considering all metrics performed. In conclusion, ML techniques, particularly with a robust cross-validation technique and hyperparameter optimization, showed good results in water quality parameter prediction.
Read full abstract