Digital mapping of WRB soil classes using linear and non-linear classification-based machine learning algorithms and integration of confusion index in knowledge discovery

Fuat Kaya,Levent Başayiğit

doi:10.5194/egusphere-egu23-510

Abstract

Multinomial Logistic Regression (MNLR), which is a linear and simple classification algorithm, the probability of each pixel belonging to a class can be calculated and the most probable classes and ground realities are compared in digital soil mapping. However, Random Forest (RF) algorithm, which is a relatively complex classification algorithm that can also discover non-linear relationships, the most probable classes can be determined by measuring the proportion of votes for each class, which it calls estimates of class probabilities in each pixel. In the current study, we used the map with eight FAO-WRB second level soil classes as a result of detailed soil survey mapping in an area of approximately 10,000 ha. We determined the data set points from the mapping units with the area-weighted sampling methodology. Digital soil class maps generated using the two classification algorithms and twenty-three variables representing parent material, organism and topography generated from the digital elevation model and Landsat 7 ETM satellite images. Classification accuracies measured using the confusion matrix. Overall accuracy calculated in training and validation set for MNLR, 52% 48%;&#160; for RF, 48% and 55%, respectively. In general, machine learning algorithms try to minimize the misclassification error and thus the error in all classes is equally important. However, in soil science, the most probable and the second most probable class probabilities produced as a result of these two classification algorithms are important. Thus, confusion index (CI), which is calculated by considering the probability values of the most probable class and the second most probable class, in the training and validation sets of each classification algorithm. Mean CI values were calculated in training and validation set as 0.73 and 0.75 for MNLR; for RF 0.36 and 0.77, respectively. As the CI approaches 0, CI informs us that the most probable class strongly belongs to the class to which it is allocated. Furthermore, there is no high difference between the two models in the training and validation sets, according to the confusion matrix results. However, in the confusion index, there is a 50% difference between the mean confusion index values of the training and validation sets for the RF algorithm. CI maps created are produced according to the model established with the training set, therefore, visual interpretation and pedologist knowledge should be integrated. Accordingly, both classification algorithms failed to digital map the Chromic Cambisol class in the study area. This soil class can be determined in the field according to a subsurface chroma value and has been difficult to capture by our environmental covariate set. We suggest that in addition to giving general accuracy values in the production of any digital soil classes map, the calculation of the confusion index values and their interpretation with pedological information.

Full Text