Abstract

The Random Forest (RF) algorithm, a decision-tree-based technique, has become a promising approach for applications addressing runoff forecasting in remote areas. This machine learning approach can overcome the limitations of scarce spatio-temporal data and physical parameters needed for process-based hydrological models. However, the influence of RF hyperparameters is still uncertain and needs to be explored. Therefore, the aim of this study is to analyze the sensitivity of RF runoff forecasting models of varying lead time to the hyperparameters of the algorithm. For this, models were trained by using (a) default and (b) extensive hyperparameter combinations through a grid-search approach that allow reaching the optimal set. Model performances were assessed based on the R2, %Bias, and RMSE metrics. We found that: (i) The most influencing hyperparameter is the number of trees in the forest, however the combination of the depth of the tree and the number of features hyperparameters produced the highest variability-instability on the models. (ii) Hyperparameter optimization significantly improved model performance for higher lead times (12- and 24-h). For instance, the performance of the 12-h forecasting model under default RF hyperparameters improved to R2 = 0.41 after optimization (gain of 0.17). However, for short lead times (4-h) there was no significant model improvement (0.69 < R2 < 0.70). (iii) There is a range of values for each hyperparameter in which the performance of the model is not significantly affected but remains close to the optimal. Thus, a compromise between hyperparameter interactions (i.e., their values) can produce similar high model performances. Model improvements after optimization can be explained from a hydrological point of view, the generalization ability for lead times larger than the concentration time of the catchment tend to rely more on hyperparameterization than in what they can learn from the input data. This insight can help in the development of operational early warning systems.

Highlights

  • IntroductionAmong the machine learning techniques most widely used in different fields of science, the Random Forest (RF) [1] is one of the most useful and best performing for both classification and regression applications [2,3,4,5,6,7,8,9,10]

  • The analysis described was performed for the models built for different lead times: 4, 12, and 24 h

  • This study evaluated the impact of the most relevant Random Forest (RF) hyperparameters in the performance of shortterm runoff forecasting models in mountainous regions

Read more

Summary

Introduction

Among the machine learning techniques most widely used in different fields of science, the Random Forest (RF) [1] is one of the most useful and best performing for both classification and regression applications [2,3,4,5,6,7,8,9,10]. Used for timeseries data, there are few RF applications in the current literature [11]. Robustness is explained by the capability of the RF algorithm to deal with datasets with specific problems such as missing and/or outliers values, non-standardized, and unbalanced in relatively high dimensional spaces [7]. The RF algorithm allows complex interactions between input features, and this results in a relative well handling of model overfitting [3]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.