Abstract

Input variable selection (IVS) is an integral part of building data-driven models for hydrological applications. Carefully chosen input variables enable data-driven models to discern relevant patterns and relationships within data, improving their predictive accuracy. Moreover, the optimal choice of input variables can enhance the computational efficiency of data-driven models, reduce overfitting, and contribute to a more interpretable and parsimonious model. Meanwhile, including irrelevant and/or redundant input variables can introduce noise to the model and hinder its generalization ability. Three probabilistic IVS methods, namely Edgeworth approximation-based conditional mutual information (EA), double-layer extreme learning machine (DLELM), and gradient mapping (GM), were used for IVS and then coupled with a long short-term memory (LSTM)-based probabilistic deep learning model for daily streamflow prediction. While the EA method is an effective IVS method, DLELM and GM are examples of probabilistic neural network-based IVS methods that have not yet been explored for hydrological prediction. DLELM selects input variables through sparse Bayesian learning, pruning both input and output layer weights of a committee of neural networks. GM is based on saliency mapping, an explainable AI technique commonly used in computer vision that can be coupled with probabilistic neural networks. Both DLELM and GM involve randomization during parameter initialization and/or training thereby introducing stochasticity into the IVS procedure, which has been shown to improve the predictive performance of data-driven models. The IVS methods were coupled with a LSTM-based probabilistic deep learning model and applied to a streamflow prediction case study using 420 basins spread across the continental United States. The dataset includes 37 candidate input variables derived from the daily-averaged ERA-5 reanalysis data. Comparing the most frequently selected input variables by EA, DLELM, and GM across the 420 basins revealed that all three models select a similar number of input variables. For example, the top 15 input variables selected by all methods included nine variables that were similar. The input variables selected by EA, DLELM, and GM were then used in the LSTM-based probabilistic deep learning models for streamflow prediction across the 420 basins. The probabilistic deep learning models were developed and optimized using the top 10 variables selected by each IVS method. The results were compared to a benchmark scenario that used all 37 ERA-5 variables in the prediction model. Overall, the findings show that the GM method results in higher prediction accuracy (Kling-Gupta efficiency; KGE) compared to the other two IVS methods. A median KGE of 0.63 was obtained for GM, whereas for the EA, DLELM, and all input variables’ scenario, KGE scores of 0.61, 0.60, and 0.62 were obtained, respectively. DLELM and GM are two AI-based techniques that introduce elements of interpretability and stochasticity to the IVS process. The results of the current study are expected to contribute to the evolving landscape of data-driven hydrological modeling by introducing hitherto unexplored neural network-based IVS to pursue more parsimonious, efficient, and interpretable probabilistic deep learning models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.