Systematic sample subdividing strategy for training landslide susceptibility models

Maher Ibrahim Sameen,Biswajeet Pradhan,Dieu Tien Bui,Abdullah M Alamri

doi:10.1016/j.catena.2019.104358

Abstract

Current practice in choosing training samples for landslide susceptibility modelling (LSM) is to randomly subdivide inventory information into training and testing samples. Where inventory data differ in distribution, the selection of training samples by a random process may cause inefficient training of machine learning (ML)/statistical models. A systematic technique may, however, produce efficient training samples that well represent the entire inventory data. This is particularly true when inventory information is scarce. This research proposed a systemic strategy to deal with this problem based on the fundamental distribution of probabilities (i.e. Hellinger) and a novel graphical representation of information contained in inventory data (i.e. inventory information curve, IIC). This graphical representation illustrates the relative increase in available information with the growth of the training sample size. Experiments on a selected dataset over the Cameron Highlands, Malaysia were conducted to validate the proposed methods. The dataset contained 104 landslide inventories and 7 landslide-conditioning factors (i.e. altitude, slope, aspect, land use, distance from the stream, distance from the road and distance from lineament) derived from a LiDAR-based digital elevation model and thematic maps acquired from government authorities. In addition, three ML/statistical models, namely, k-nearest neighbour (KNN), support vector machine (SVM) and decision tree (DT), were utilised to assess the proposed sampling strategy for LSM. The impacts of model’s hyperparameters, noise and outliers on the performance of the models and the shape of IICs were also investigated and discussed. To evaluate the proposed method further, it was compared with other standard methods such as random sampling (RS), stratified RS (SRS) and cross-validation (CV). The evaluations were based on the area under the receiving characteristic curves. The results show that IICs are useful in explaining the information content in the training subset and their differences from the original inventory datasets. The quantitative evaluation with KNN, SVM and DT shows that the proposed method outperforms the RS and SRS in all the models and the CV method in KNN and DT models. The proposed sampling strategy enables new applications in landslide modelling, such as measuring inventory data content and complexity and selecting effective training samples to improve the predictive capability of landslide susceptibility models.

Full Text