In risk assessment of cardiovascular disease (CVD), the classification error caused by unbalanced data is a significant challenge, which has sparked widespread concern and research upsurge in the field of data mining. Therefore, in view of the imbalance of CVD data sets, an oversampling method via adaptive double weights and Gaussian kernel function (ADWGKFO) is proposed, which converts the unbalanced data sets into balanced data sets. Firstly, clustering algorithm is utilized to cluster minority samples, boundary samples are identified by Borderline-Synthetic Minority Over-sampling Technique (Borderline-SMOTE), K nearest neighbor and support vector machine algorithms, and the number of samples synthesized in each group is calculated based on the double weights of boundary points and majority distribution. Secondly, in order to clearly define the classification boundary, the mutual class potential of new samples in each cluster is calculated by Gaussian kernel function, and new samples are filtered according to the mutual class potential until the data set is balanced. Finally, taking the data sets from Kaggle platform as the research samples, the proposed method is empirically analyzed. In order to validate the efficacy and universality of the proposed method, this paper selects CatBoost that is a new integrated algorithm to test the effect of the ADWGKFO method, and compares it with different sampling methods and different classifiers using performance evaluation indexes such as accuracy, F1-score and area under the curve (AUC). Compared with the combinations of other methods, the accuracy, F1-score and AUC are significantly improved. It is concluded that the ADWGKFO method described in this paper can successfully improve the data quality, and increases the reliability of CVD risk assessment.
Read full abstract