Contaminated sediments can adversely affect aquatic ecosystems, making the identification and management of pollutant sources extremely important. In this study, we proposed machine learning approaches to reveal sources and their influential distances for heavy metal contamination of downstream sediment. We employed classification models with artificial neural networks (ANN) and random forest (RF), respectively, to predict the heavy metal contamination of stream sediments using upland environmental variables as input features. A comprehensive Korean nationwide monitoring database containing 1546 datasets was used to train and test the models. These datasets encompass the concentrations of eight heavy metals (Ar, Cd, Cr, Cu, Hg, Ni, Pb, and Zn) in sediment samples collected from 160 stream sites across the nation from 2014 to 2018. Model's prediction accuracy was evaluated for input feature sets from different influential upland areas defined by different buffer radii and the watershed boundary for each site. Although both ANN and RF models were unsatisfactory in predicting heavy metal quartile classes, RF-classifiers with adaptive synthetic oversampling (ORFC) showed reasonably well-predicted classes of the sediment samples based on the Canada's Sediment Quality Guidelines (accuracy ranged from 0.67 to 0.94). The best influential distance (i.e., buffer radius) was determined for each heavy metal based on the accuracy of ORFC. The results indicated that Cd, Cu and Pb had shorter influential distances (1.5–2.0 km) than the other heavy metals with little difference in accuracy for different influential distances. Feature importance calculation revealed that upland soil contamination was the primary factor for Hg and Ni, while residential areas and roads were significant features associated with Pb and Zn contamination. This approach offers information on major contamination sources and their influential areas to be prioritized for managing contaminated stream sediments.
Read full abstract