Predictive lithology mapping using semisupervised learning: Practical insights using a case study from New South Wales, Australia

Michael W Dunham,Alison Malcolm,J Kim Welford

doi:10.1190/geo2022-0476.1

Abstract

We develop a comprehensive study involving three different types of machine learning (unsupervised, supervised, and semisupervised, which we emphasize) for bedrock-lithology classification using a publicly available data set from New South Wales, Australia. The goal of this work is to demonstrate (1) the value each different type of machine learning can provide and (2) which machine learning type(s) may be preferable under different circumstances. Training data are characteristically limited for geoscience problems, which makes supervised techniques susceptible to overfitting; we explore if semisupervised methods can perform better in these circumstances. Using the geophysical data and geologic map provided for the study area, we compare the performance of two supervised methods (the Light Gradient Boosting Machine and eXtreme Gradient Boosting) with one semisupervised algorithm (label propagation [LP]) in three scenarios with varied limited a priori lithologic constraints (i.e., the training data). Hyperparameter tuning is an essential component of supervised and semisupervised techniques, and the default procedure is to choose the hyperparameter combination with the largest mean cross-validation score. However, we use a new hyperparameter selection strategy that simultaneously uses the mean and standard deviation scores, and we test this new tactic for supervised and semisupervised methods. The results indicate (1) that the new hyperparameter selection technique can slightly improve the performance for supervised and semisupervised methods by 1%–2% compared with the standard selection approach and (2) that LP can outperform the two supervised methods by up to 10%, but it depends on how the training data are distributed. As for the unsupervised analysis, the clusters indicate heterogeneous regions that correlate well with the high-entropy areas in the supervised and semisupervised results. The clustering provides complementary results to the other two types of machine learning and is a source of supporting evidence for suggesting where more in-depth field mapping may be needed.

Full Text