A graded proportion method of training sample selection for updating conventional soil maps

Xueqi Liu,A-Xing Zhu,Lin Yang,Tao Pei,Junzhi Liu,Canying Zeng,Desheng Wang

doi:10.1016/j.geoderma.2019.113939

Abstract

Selection of training samples is a vital step in updating conventional soil maps when utilizing data mining models. Quality of training samples significantly affects the mapping results and accuracies of the updated soil maps. The area-weighted proportion method was a common method for generating training samples. However, this method usually assigns too small weight to those soil types of small areas and large weight to those of large areas in sample size allocation, which causes the unreasonable proportions of sample numbers for soil types and thereby biases the representation of soil-environmental relationships for those soil types. Meanwhile, random selection of training samples from a soil type may generate some ‘noise’ samples located in the transition areas between soil types. These two aspects in training sample selection could probably reduce the accuracy of the updated soil maps. In this study, a new method was developed to select training samples based on soil type grading according to their area coverages. The method consists of two steps. The first step is to determine the numbers of training samples for each soil type based on soil type grading so as to maintain the reasonable proportion in sample numbers among soil types with different area coverages. The second step is to select typical (representative) samples for each soil type from conventional soil map, to avoid generation of ‘noise samples’. To evaluate the proposed method, the method was compared with three other training sample selection methods with four training sample sizes. Each method was ran for 100 times to generate training sample datasets with each sample size to evaluate their effectiveness and stability. Random forest was employed to generate updated soil maps in a small watershed in Raffelson, Wisconsin (USA). The validation results showed that the graded proportion method effectively solved the imbalanced issue of training samples among soil types with area coverages in big differences caused by the area-weighted proportion strategy. Thus training samples generated using the proposed method usually obtained more accurate and reasonable mapping results than those using the area-weighted proportion strategy. Furthermore, the performance of the proposed method was more stable than that of the area-weighted proportion strategy with the training sample size increasing. It is concluded that the proposed method is an effective training sample selection method for data mining model to update conventional soil maps.

Full Text