Accurately predicting drinking water quality is critical for intelligent water supply management and for maintaining the stability and efficiency of water treatment processes. This study presents an optimized time series machine learning approach for accurately predicting multivariate drinking water quality, explicitly considering the time-dependent effects of reagent dosing. By leveraging data from a full-scale treatment plant, we constructed feature-engineered time series datasets incorporating influent water quality parameters, reagent dosages and effluent water characteristics. Seven predictive models, including both traditional machine learning (ML) and deep learning (DL) models were developed and rigorously evaluated against a naive mean baseline model. Our results demonstrate that traditional ML models, enhanced with time feature engineering, rivaled the performance of both widely used DL models and the naive mean baseline model. Specifically, an XGBoost model achieved superior prediction accuracy in dynamically forecasting four water quality characteristics at a 12-hour time lag step, outperforming the naive baseline model by 3-4% in terms of Mean Absolute Percentage Error (MAPE). This finding underscores the importance of incorporating a 12-hour interval to effectively capture the delayed impact of reagent dosing on water quality prediction. Furthermore, SHAP model interpretability analysis provided valuable insights into the XGBoost model's decision-making process, revealing its strong data-driven foundation aligned with established water treatment principles. This research highlights the significant potential of optimized machine learning techniques for enhancing water purification processes and enabling more informed, data-driven decision-making in the water supply industry.
Read full abstract