Determining the occurrence of disinfection byproducts (DBPs) in drinking water distribution system (DWDS) remains challenging. Predicting DBPs using readily available water quality parameters can help to understand DBPs associated risks and capture the complex interrelationships between water quality and DBP occurrence. In this study, we collected drinking water samples from a distribution network throughout a year and measured the related water quality parameters (WQPs) and haloacetic acids (HAAs). 12 machine learning (ML) algorithms were evaluated. Random Forest (RF) achieved the best performance (i.e., R2 of 0.78 and RMSE of 7.74) for predicting HAAs concentration. Instead of using cytotoxicity or genotoxicity separately as the surrogate for evaluating toxicity associated with HAAs, we created a health risk index (HRI) that was calculated as the sum of cytotoxicity and genotoxicity of HAAs following the widely used Tic-Tox approach. Similarly, ML models were developed to predict the HRI, and RF model was found to perform the best, obtaining R2 of 0.69 and RMSE of 0.38. To further explore advanced ML approaches, we developed 3 models using uncertainty-based active learning. Our findings revealed that Categorical Boosting Regression (CAT) model developed through active learning substantially outperformed other models, achieving R2 of 0.87 and 0.82 for predicting concentration and the HRI, respectively. Feature importance analysis with the CAT model revealed that temperature, ions (e.g., chloride and nitrate), and DOC concentration in the distribution network had a significant impact on the occurrence of HAAs. Meanwhile, chloride ion, pH, ORP, and free chlorine were found as the most important features for HRI prediction. This study demonstrates that ML has the potential in the prediction of HAA occurrence and toxicity. By identifying key WQPs impacting HAA occurrence and toxicity, this research offers valuable insights for targeted DBP mitigation strategies.