Chlorophyll-a (Chl-a) concentrations, a key indicator of algal blooms, were estimated using the XGBoost machine learning model with 23 variables, including water quality and meteorological factors. The model performance was evaluated using three indices: root mean square error (RMSE), RMSE-observation standard deviation ratio (RSR), and Nash-Sutcliffe efficiency. Nine datasets were created by averaging 1hour data to cover time frequencies ranging from 1hour to 1month. The dataset with relatively high observation frequencies (1-24 h) maintained stability, with an RSR ranging between 0.61 and 0.65. However, the model's performance declined significantly for datasets with weekly and monthly intervals. The Shapley value (SHAP) analysis, an explainable artificial intelligence method, was further applied to provide a quantitative understanding of how environmental factors in the watershed impact the model's performance and is also utilized to enhance the practical applicability of the model in the field. The number of input variables for model construction increased sequentially from 1 to 23, starting from the variable with the highest SHAP value to that with the lowest. The model's performance plateaued after considering five or more variables, demonstrating that stable performance could be achieved using only a small number of variables, including relatively easily measured data collected by real-time sensors, such as pH, dissolved oxygen, and turbidity. This result highlights the practicality of employing machine learning models and real-time sensor-based measurements for effective on-site water quality management. PRACTITIONER POINTS: XAI quantifies the effects of environmental factors on algal bloom prediction models The effects of input variable frequency and seasonality were analyzed using XAI XAI analysis on key variables ensures cost-effective model development.
Read full abstract