AbstractRandom Forest (RF) is a widely used machine learning algorithm for crop type mapping. RF’s variable importance aids in dimension reduction and identifying relevant multisource hyperspectral data. In this study, we examined spatial effects in a sequential backward feature elimination setting using RF variable importance in the example of a large-scale irrigation system in Punjab, Pakistan. We generated a reference classification with RF applied to 122 SAR and optical features from time series data of Sentinel‑1 and Sentinel‑2, respectively. We ranked features based on variable importance and iteratively repeated the classification by excluding the least important feature, assessing its agreement with the reference classification. McNemar’s test identified the critical point where feature reduction significantly affected the RF model’s predictions. Additionally, spatial assessment metrics were monitored at the pixel level, including spatial confidence (number of classifications agreeing with the reference map) and spatial instability (number of classes occurring during feature reduction). This process was repeated 10 times with ten distinct stratified random sampling splits, which showed similar variable rankings and critical points. In particular, VH SAR data was selected when cloud-free optical observations were unavailable. Omitting 80% of the features resulted in an insignificant loss of only 2% overall accuracy, while spatial confidence decreased by 5%. Moreover, the crop map at the critical point exhibited an increase in spatial instability from a single crop to 1.28. McNemar’s test and the spatial assessment metrics are recommended for optimized feature reduction benchmarks and identifying areas requiring additional ground data to improve the results.
Read full abstract