The retrieval of heavy metal concentrations in naturally contaminated arable soils through hyperspectral reflectance has become increasingly significant in recent years. Presently, both conventional and deep learning-oriented metal inversion methodologies mandate the development of individualized models for each metallic element, while refraining from leveraging distant dependencies to attain the inherent heterogeneity unique to each metal species. This intricacy renders the attainment of precise inversion outcomes for particular heavy metal concentrations a formidable challenge. To tackle this challenge, the current study introduces the Spectrum Contextual Self-Attention Deep Learning Network (SCSANet), which is created to encompass long-range spectral context dependencies by employing a self-attention network. The model also includes efficient and precise spectral input techniques, as well as simultaneous output of multiple metals. Experiments are carried out in the study area to assess the precision of the proposed model in identifying lead (Pb), copper (Cu), cadmium (Cd), and mercury (Hg) concentrations. The findings reveal that metal inversion does not have a significant impact on pre-processed spectra, and enhancing the input technique of neighbourhood spectra can boost the accuracy of the inversion. The SCSANet model proposed herein achieves the highest inversion accuracy for metals with a similar magnitude of content, and outperforms the compared method in terms of inversion accuracy.