Oil source correlation can be used to identify the origin of crude oil by linking crude oil to source rocks; however, the manual methods, which are limited by the sample or parameter quantity or imbalanced datasets, are facing uncertainties. Although the existing multivariate statistical techniques can alleviate this problem, they are facing difficulties in processing imbalanced datasets and quantifying source beds. Therefore, a novel oil-source correlation analysis model called SVM-SelectKBest combining a support vector machine (SVM) with a feature selection algorithm to mitigate the common issue of dataset imbalance in oil-source correlations is proposed in this paper. The SVM-SelectKBest offers advantages over normal SVM by dynamically selecting the most relevant features and fine-tuning model parameters to achieve higher accuracy and better generalizability in complex datasets. SVM compensates for class imbalances by heavily penalizing the misclassification of the minority class, and SelectKBest streamlines the feature set to enhance SVM's effectiveness on critical variables. Furthermore, a shallow neural network (SensoryAttentionNet) is introduced into the proposed model to address the issue of quantifying the source bed proportions in crude oil. The result show that SVM-SelectKBest has better performance in identifying key geochemical parameters and discriminating oil source correlation, its accuracy in unbalanced datasets is improved by near 40% compared to SVM. The model obtains 25 key geochemical parameters such as C17 n-heptadecane, Pr pristane, and C18 n-octadecane, it achieves F1 scores of 1.0 on the training, validation, and test sets. SensoryAttentionNet also performs robustly, with a low variance of 0.05 between its predicted and actual values. All the results indicate the effectiveness of the proposed method in dealing with the imbalance problem in oil-source source correlation datasets and in determining the proportional contribution of source beds in crude oil.
Read full abstract