A stacking ANN ensemble model of ML models for stream water quality prediction of Godavari River Basin, India

Nagalapalli Satish,Jagadeesh Anmala,K Rajitha,Murari R.R Varma

doi:10.1016/j.ecoinf.2024.102500

Abstract

The importance of water quality models has increased as their inputs are critical to the development of risk assessment framework for environmental management and monitoring of rivers. However, with the advent of a plethora of recent advances in ML algorithms better predictions are possible. This study proposes a causal and effect model by considering climatological such as temperature and precipitation along with geospatial information related to the agricultural land use factor (ALUF), the forest land use factor (FLUF), the grassland usage factor (GLUF), the shrub land use factor (SLUF), and the urban land use factor (ULUF). All these factors are included in the input data, whereas four Stream Water Quality parameters (SWQPs) such as Electrical Conductivity (EC), Biochemical Oxygen Demand (BOD), Nitrate, and Dissolved Oxygen (DO) from 2019 to 2021 are taken as outputs to predict the Godavari River Basin water quality. In the preliminary investigation, out of these four SWQPs, nitrate's coefficient of variation (CV) is high, revealing a close association with climate parameters and land use practices across the sampling stations. In the authors' earlier study, a model using a single-layer Feed-Forward Neural Network (FFNN) showed improved performance in predicting cause and effect factors linked to water quality metrics. To achieve better prediction, a stacked ANN meta-model and nine conventional machine learning (ML) models, including Extreme Gradient Boosting (XGB), Extra Trees (ET), Bagging (BG), Random Forest (RF), AdaBoost or Adaptive Boosting (ADB), Decision Tree (DT), Highest Gradient Boosting (HGB), Light Gradient Boosting Method (LGBM), and Gradient Boosting (GB), were compared in this study. According to the study's findings, Bagging and Boosting models outperformed stand-alone earlier FFNN for the same dataset and showed superior predictive capabilities in terms of accuracy in forecasting the variable of interest. For instance, during testing, the coefficient of determination (R2) of Biochemical Oxygen Demand (BOD) increased from 0.72 to 0.87. Furthermore, a stacked Artificial Neural Network (ANN) meta model that was reinforced using Extreme Gradient Boosting (XGB), Random Forest (RF), and Extra Trees (ET) as base models performed better than the individual ML models (from R2 = 0.87 to 0.91 for BOD in testing). By using this new framework, the effort for hyperparameter tuning can be minimized.

Full Text