AbstractThe present study examined how the use of soil color can help build and evaluate clay content prediction models from laboratory visible and near infrared spectroscopic data. This study was based on a regional database containing 449 soil samples collected over Karnataka state in India, which has been divided into red soils (240 samples) and black soils (209 samples) based on their Munsell soil color. Partial least squares regression models were calibrated and validated from both the regional datasets and subsets stratified as red and black soils. In addition, a random forest model was used to classify the validation soil samples into black and red classes to evaluate models’ performance. First, while the clay content predicted by the regression model built from regional data was evaluated as correct at regional scale (R2val of 0.75), this model was evaluated as more accurate over black (R2val of 0.8) than red (R2val of 0.63) soil samples. Second, the regression models built from subsets stratified per soil color provided different performances than the regression model built from the regional data, both at the regional scale and soil color scale. In conclusion, this study demonstrated that (1) predictions are highly dependent on calibration data, (2) the interpretation of prediction performances relies heavily on validation data, and (3) pedological knowledge, such as soil color, can be effectively employed as an encouraging covariate in both the construction and evaluation of regression models.