This study utilizes machine learning (ML) algorithms to develop a robust total organic carbon (TOC) prediction model for river waters in the Geumho River sub-basins, South Korea, considering both non-rain and rain events. The model incorporates geospatial parameters such as land use, slope, flow rate, and basic water quality metrics including biochemical oxygen demand (BOD), chemical oxygen demand (COD), total nitrogen (TN), total phosphorus (TP), and suspended solids (SS). A key aspect of this research is examining how land use information enhances the model's predictive accuracy. We compared two ML algorithms—extreme gradient boosting (XGBoost) and deep neural networks (DNN)—with a traditional multiple linear regression (MLR) approach. XGBoost outperformed the others, achieving an R2 value between 0.61 and 0.68 in the test dataset and demonstrating significant improvement during rain events with an R2 of 0.77 when including land use data. In contrast, this enhancement was not observed with the MLR model. Feature importance analysis using Shapley values highlighted COD as the primary predictor for non-rain events, while during rain events, COD, TP, TN, SS and agricultural land collectively influenced TOC levels. This study significantly advances understanding of TOC variability across different land use scenarios in river systems and underscores the importance of integrating geospatial and water quality parameters to enhance TOC prediction, particularly during rain events. This methodology provides a valuable framework for developing river management strategies and monitoring long-term TOC trends, especially in scenarios with gaps in essential monitoring data.
Read full abstract