Excessive concentrations of Ni in soil have many severe effects, negatively affecting human health and leading to disease, while also posing a threat to animals and plants. Although the dangers of high Ni concentrations have been widely recognized, rapid and large-scale tools for the identification of Ni contamination are still lacking. Visible-near-infrared (Vis-NIR) spectroscopy has been employed to rapidly identify Ni contamination; however, previous studies suffer from issues inherent to small datasets and the tendency to negate data imbalances. To address these issues, a large dataset comprising 18,675 soil samples was used to predict soil Ni contamination by combining Vis-NIR data with machine learning (ML). The data imbalance inherent to previous studies was addressed using two data sampling methods. To build a robust classification model for Ni contamination, four spectral preprocessing methods and four ML algorithms were compared. The optimal extreme gradient boosting model achieved recall, accuracy, area under the curve, and geometric mean scores of 0.8203, 0.8806, 0.9268, and 0.8508, respectively. Model predictions across the United States identified specific regions with high possibility of Ni contamination. Overall, the model developed in this study offers an improved accuracy in predicting soil Ni contamination at the continental scale, and can be used to prioritize further testing and guide policymaking.
Read full abstract