Abstract
This study demonstrates the complexity and importance of water quality as a measure of the health and sustainability of ecosystems that directly influence biodiversity, human health, and the world economy. The predictability of water quality thus plays a crucial role in managing our ecosystems to make informed decisions and, hence, proper environmental management. This study addresses these challenges by proposing an effective machine learning methodology applied to the “Water Quality” public dataset. The methodology has modeled the dataset suitable for providing prediction classification analysis with high values of the evaluating parameters such as accuracy, sensitivity, and specificity. The proposed methodology is based on two novel approaches: (a) the SMOTE method to deal with unbalanced data and (b) the skillfully involved classical machine learning models. This paper uses Random Forests, Decision Trees, XGBoost, and Support Vector Machines because they can handle large datasets, train models for handling skewed datasets, and provide high accuracy in water quality classification. A key contribution of this work is the use of custom sampling strategies within the SMOTE approach, which significantly enhanced performance metrics and improved class imbalance handling. The results demonstrate significant improvements in predictive performance, achieving the highest reported metrics: accuracy (98.92% vs. 96.06%), sensitivity (98.3% vs. 71.26%), and F1 score (98.37% vs. 79.74%) using the XGBoost model. These improvements underscore the effectiveness of our custom SMOTE sampling strategies in addressing class imbalance. The findings contribute to environmental management by enabling ecology specialists to develop more accurate strategies for monitoring, assessing, and managing drinking water quality, ensuring better ecosystem and public health outcomes.
Published Version
Join us for a 30 min session where you can share your feedback and ask us any queries you have