Abstract

Data-driven machine learning models have been used to predict hazardous substances levels in groundwater. However, class-imbalanced data results in models that may show grossly low sensitivity even though they show high overall accuracy. To address this issue, four algorithms - weighted cross-entropy loss, Random oversampling, Random undersampling, and Adaptive synthetic sampling (ADASYN) - were tested for their validity in improving model sensitivity. Testing of the above four algorithms using geogenic high arsenic groundwater data from the Datong Basin, the Red River Delta of Vietnam, Bangladesh, Texas and California showed that all four algorithms produced more accurate predictions with an average increase in sensitivity of 53.8% compared to the raw models. The ADASYN is the best of the four algorithms and can increase model G-means (geometric mean of sensitivity and specificity) by >40% on average. The ADASYN-optimized ANN models predicted higher groundwater As exposure risk in Ghana than that in Ethiopia.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call