Tackling the impact of missing data in water management is crucial to ensure the reliability of scientific research that informs decision-making processes in public health. The goal of this study is to ascertain the root causes associated with cyanobacteria proliferation under major missing data scenarios. For this purpose, a dynamic missing data management methodology is proposed using Bayesian Machine Learning for accurate surface water quality prediction of a river from Limia basin (Spain). The methodology used entails a sequence of analytical steps, starting with data pre-processing, followed by the selection of a reliable dynamic Bayesian missing value prediction system, leading finally to a supervised analysis of the behavioral patterns exhibited by cyanobacteria. For that, a total of 2,118,844 data points were used, with 205,316 (9.69 %) missing values identified. The machine learning testing showed the iterative structural expectation maximization (SEM) as the best performing algorithm, above the dynamic imputation (DI) and entropy-based dynamic imputation methods (EBDI), enhancing in some cases the accuracy of imputations by approximately 50 % in R2, RMSE, NRMSE, and logarithmic loss values. These findings can impact how data on water quality is being processed and studied, thus, opening the door for more reliable water management strategies that better inform public health decisions.
Read full abstract