Stream nutrient concentrations exhibit marked temporal variation due to hydrology and other factors such as the seasonality of biological processes. Many water quality monitoring programs sample too infrequently (i.e., weekly or monthly) to fully characterize lotic nutrient conditions and to accurately estimate nutrient loadings. A popular solution to this problem is the surrogate-regression approach, a method by which nutrient concentrations are estimated from related parameters (e.g., conductivity or turbidity) that can easily be measured in situ at high frequency using sensors. However, stream water quality data often exhibit skewed distributions, nonlinear relationships, and multicollinearity, all of which can be problematic for linear-regression models. Here, we use a flexible and robust machine learning technique, Random Forests Regression (RFR), to estimate stream nitrogen (N) and phosphorus (P) concentrations from sensor data within a forested, mountainous drainage area in upstate New York. When compared to actual nutrient data from samples tested in the laboratory, this approach explained much of the variation in nitrate (89%), total N (85%), particulate P (76%), and total P (74%). The models were less accurate for total soluble P (47%) and soluble reactive P (32%), though concentrations of these latter parameters were in a relatively low range. Although soil moisture and fluorescent dissolved organic matter are not commonly used as surrogates in nutrient-regression models, they were important predictors in this study. We conclude that RFR shows great promise as a tool for modeling instantaneous stream nutrient concentrations from high-frequency sensor data, and encourage others to evaluate this approach for supplementing traditional (laboratory-determined) nutrient datasets.
Read full abstract