To maintain profitability in sugarcane areas of Australia, soil nutrients need to be applied to replace losses to biomass production. For example, nitrogen fertiliser requires consideration of soil organic carbon (SOC, %). However, determining SOC is time-consuming. An alternative is to use a visible–near infrared (Vis–NIR) spectroscopy library. Herein, a Vis–NIR library is developed to predict topsoil (0–0.3 m) SOC using partial least squares regression (PLSR) and machine learning (i.e., Cubist, random forest [RF] and support vector machine [SVM]) in four sugarcane districts (i.e., Mossman, Lannercost, Herbert, and Proserpine). Different approaches were compared (i.e., site-specific, site-independent, hold-out and spiking) with spike size also considered. In all comparisons, a consistent set of calibration and validation data were used. The calibration coefficient of determination (R2) was always strong (> 0.7), and generally better than the validation R2, regardless of the modelling approach, district, or spike size. For the validation, the Lin's concordance correlation coefficient (LCCC) showed PLSR (0.92, and 0.9) and Cubist (0.91 and 0.9) were close to perfect (> 0.9) for site-specific and site-independent, respectively. This was not the case for hold-out, with only strong R2 (0.71) and substantial agreement (0.80) in Herbert using Cubist and moderate overall using PLSR. Similar results were achieved in terms of the accuracy considering the ratio of performance to interquartile (RPIQ), whereby overall site-specific and site-independent approaches had excellent accuracy (> 2.5) with Cubist slightly more accurate than PLSR. Hold-out accuracy was generally very poor (< 1.4). Spiking the hold-out data sets produced mixed results with prediction R2, agreement and accuracy respectively best in Lannercost with 70 or more samples using PLSR (strong, substantial and excellent) and Hebert with 10 or more using Cubist (strong, near perfect and excellent), while in Mossman with 50 or more samples using SVM (very weak, poor and fair) and Proserpine with 30 or more samples using Cubist (weak, moderate and fair) the results were not as good. It can be concluded that either site-specific or site-independent approach to calibration and prediction using either PLSR or Cubist was best, with the use of the latter approach being more efficient and allowing for the potential to add to this spectral library when new samples from each area or new areas can be added.