Machine-learning techniques allow geoscientists to extract meaningful information from data in an automated fashion, and they are also an efficient alternative to traditional manual interpretation methods. Many geophysical problems have an abundance of unlabeled data and a paucity of labeled data, and the lithology classification of wireline data reflects this situation. Training supervised algorithms on small labeled data sets can lead to overtraining, and subsequent predictions for the numerous unlabeled data may be unstable. However, semisupervised algorithms are designed for classification problems with limited amounts of labeled data, and they are theoretically able to achieve better accuracies than supervised algorithms in these situations. We explore this hypothesis by applying two semisupervised techniques, label propagation (LP) and self-training, to a well-log data set and compare their performance to three popular supervised algorithms. LP is an established method, but our self-training method is a unique adaptation of existing implementations. The well-log data were made public through an SEG competition held in 2016. We simulate a semisupervised scenario with these data by assuming that only one of the 10 wells has labels (i.e., core samples), and our objective is to predict the labels for the remaining nine wells. We generate results from these data in two stages. The first stage is applying all the algorithms in question to the data as is (i.e., the global data), and the results from this motivate the second stage, which is applying all algorithms to the data when they are decomposed into two separate data sets. Overall, our findings suggest that LP does not outperform the supervised methods, but our self-training method coupled with LP can outperform the supervised methods by a notable margin if the assumptions of LP are met.
Read full abstract