Deep learning methods are increasingly used in seismic, but the black-box nature of neural networks hinders the confidence users may have in their outputs. Moreover, conventional neural networks are not probabilistic. In velocity model building, neural networks predict a velocity value whether they are confident in the result or not. The absence of confidence information is problematic when there are differences between the features of the training and the target data (e.g., noise sources), which is the norm rather than the exception. We propose to restate velocity regression into a classification problem by binning the velocities and making neural networks output a confidence in each of the binned velocities rather than predicting a single, scalar velocity. To do so, we modify only the tail end of neural networks and turn the output scalars into softmax distributions. We take the median over the distributions as the end results and we interpret the standard deviation of a distribution as a mistrust metric. We compare the performance of regression and classification, and we observe that classification is more accurate: the RMSE between the target and predicted models for classification (181 m/s) is lower than for regression (213 m/s). Classification provides results that are less overconfident than that of regression under an ensemble scheme. Indeed, the average standard deviation for the classification is 387 m/s, whereas it is 223 m/s for regression, the standard deviation indicating the precision of the results. Moreover, classification leads to predictions that are within 1 standard deviation of the ground truth 77.4% of the time, whereas, for regression, they are 49.9% of the time only. Last, we confirm the relevance of the mistrust metric by observing the impact of 2D structures on the confidence of networks trained only on 1D layered models. Findings indicate that a single classification neural network may be used instead of an ensemble of regression neural networks and that classification would use fewer resources (8 times fewer GPU hours when the size of the ensemble is 16) and be easier to interpret.
Read full abstract