Similarity algorithms are commonly used in soil forensic applications to help identify similar samples from an existing reference library as possible source locations of unknown target samples. These algorithms are well-suited to compare soil spectra. However, different similarity algorithms may lead to different clusters of similar samples, and thus different strengths of evidence in forensic investigations. To quantify this, we conducted a study to evaluate the influence of seven similarity algorithms on soil provenance, using as a sample set a soil spectral library consisting of 280 soil profiles from Anhui Province, China. This library includes three spatial scales of datasets: provincial (DSp), county (DSc) and field (DSf). A set of ten samples covering a wide range of spectra variations were selected from the DSf dataset as the “unknown” samples, with the remaining being used as the reference samples. This study aimed to: (1) evaluate how several commonly-used similarity algorithms, namely Euclidean distance (ED), Mahalanobis distance (MD), Spectral angle mapper (SAM), and Spectral information divergence (SID), as well as variants of several of these measured in standardized principal component space computed from the spectra (ED_PCA, MD_PCA and SAM_PCA), influence the identification of the matched similar samples; (2) determine the overlap in sample selection between different similarity algorithms; (3) propose best practices for similarity algorithms applied to soil forensic analysis using spectroscopy. The use of different similarity algorithms did influence the selection of most similar samples. The similarity algorithms calculated in PC space (ED_PCA, MD_PCA and SAM_PCA) performed slightly better than their counterparts calculated in spectral space. Due to the availability of a detailed spectral library, regardless of the different similarity algorithms used, the matched most similar samples were all located close to the unknowns, mostly within 3 km, with one exception. That is, the varied choices of different similarity algorithms hardly influenced the conclusion of soil provenance in this case. In general, MD_PCA, SAM and ED were the best similarity algorithms overall. However, since there was no single best algorithms for all cases, we recommend the joint use of MD_PCA, SAM and ED as an ensemble. Indications of possible sample provenance from these similarity measured can be useful evidence to complement evidence from other methods in a forensic investigation.
Read full abstract