Machine Learning (ML) based estimation of petrophysical properties from downhole geophysical logs has attracted a significant attention from across the geoscience community in recent years. The main focus of the studies in the area is to identify the most accurate ML algorithm for petrophysical property prediction using logging data often from a very limited number of boreholes (e.g. only one or two). Despite having vertical variability in one borehole, the horizontal variability cannot be captured using limited number of boreholes thus the effect of spatial variability on ML based models' performance needs a specific attention. We have therefore developed an innovative workflow that assesses the impact of spatial variation in training geophysical logs data on the accuracy of ML models in predicting formation properties e.g. compressional wave velocity. To build this systematic workflow, a large dataset containing geophysical logs from 50 boreholes in a coal mine in Bowen Basin, QLD, Australia was selected for the analysis. Boreholes mainly intersected geological formations from Rangal coal measure rocks. The data included density, compressional wave velocity (VP), gamma ray, resistivity and depth. The workflow consists of i) a clustering analysis using the K-mean algorithm to determine if the data could be partitioned into smaller, more homogeneous clusters based on density, gamma ray and measured depth, ii) putting aside five boreholes for validation (these boreholes are not seen by the models, named test boreholes) and iii) selectin training datasets randomly from the remaining boreholes grouping them into 1, 5, 10, 20, 30 and 40 boreholes. The data selection from each group was repeated randomly five times to capture the potential randomness; and iv) the Least Squares Support Vector Regression model of each group was trained with 10, 20, 30, …, 90 and 100% of the data of each group forming 300 predictive models. The accuracy of developed models to predict VP from other logs was evaluated by calculating R2 value comparing the estimated and the actual VP values for five test boreholes. The results from the proposed workflow interestingly suggest that the model should prioritize incorporating data from various boreholes rather than collecting more data points from a limited number of boreholes. This observation is mainly linked to high lateral variability in the study area. Consequently, utilizing a smaller portion of data from a larger number of boreholes yields a simpler yet more accurate predictive model, which also reduces the computational cost.
Read full abstract