Modelling landslide susceptibility prediction: A review and construction of semi-supervised imbalanced theory

Faming Huang,Haowen Xiong,Shui-Hua Jiang,Chi Yao,Xuanmei Fan,Filippo Catani,Zhilu Chang,Xiaoting Zhou,Jinsong Huang,Keji Liu

doi:10.1016/j.earscirev.2024.104700

Abstract

Fully supervised machine learning models are widely applied for landslide susceptibility prediction (LSP), mainly using landslide and non-landslide samples as output variables and related conditioning factors as input variables. However, there are many uncertain issues in LSP modelling; for example, known landslide samples may have errors, non-landslide samples randomly selected from the whole study area are not accurate, the ratio of landslide to non-landslide samples set as 1:1 is not consistent with the actual landslide distribution characteristics, it is unreasonable to assign samples labelled non-landslide a probability of 0, and it is difficult to achieve a comprehensive assessment of LSP performance. Based on a review of the literature, we innovatively propose a semi-supervised imbalanced theory to overcome these uncertain issues. First, based on landslide samples (occurrence probability assigned 1), randomly selected non-landslide samples (occurrence probability assigned 0), and slope units divided by the multi-scale segmentation method and related conditioning factors, a supervised machine learning model is constructed and used to predict the initial landslide susceptibility indexes (LSIs), which are then classified as very low, low, moderate, high and very high landslide susceptibility levels (LSLs). Second, the landslide samples with LSLs classified as very low are removed to reduce errors in landslides, and non-landslide samples are randomly selected from the low and very low LSL groups to ensure the accuracy of non-landslides. We refer to this type of sample selection as a semi-supervised learning strategy. Third, the sampling ratio of landslide to non-landslide samples is successively set to values from 1:1 to 1:200, the initial LSIs are assigned as the labels of the corresponding non-landslide samples, and the labels of landslide samples are still assigned the value 1. We call these processes as the imbalanced sampling strategy. Fourth, we use the labelled landslide and non-landslide samples to train and test the supervised machine learning again. Finally, the optimal ratio of landslide samples to non-landslide samples can be determined to obtain the final LSP results through comparisons of LSP accuracy and LSI distribution characteristics under different sampling ratios. Jiujiang City in Jiangxi Province of China is the study area. The results show that the ROC and prediction rate accuracies of semi-supervised imbalanced RF model gradually increase from 0.979 and 0.853 to 0.990 and 0.912, respectively, with the imbalanced ratios rise from 1:1 to 1:160. Then both accuracies tend to converge as the ratio rises from 160 to 200. Hence, the LSP results of the semi-supervised imbalanced theory are efficient when the ratio of landslides to non-landslides is1:160. We conclude that the proposed theory significantly improves the theoretical basis of LSP modelling.

Full Text