DBCSMOTE: a clustering-based oversampling technique for data-imbalanced warfarin dose prediction

Yanyun Tao,Yuzhen Zhang,Bin Jiang

doi:10.1186/s12920-020-00781-2

Abstract

BackgroundVitamin K antagonist (warfarin) is the most classical and widely used oral anticoagulant with assuring anticoagulant effect, wide clinical indications and low price. Warfarin dosage requirements of different patients vary largely. For warfarin daily dosage prediction, the data imbalance in dataset leads to inaccurate prediction on the patients of rare genotype, who usually have large stable dosage requirement. To balance the dataset of patients treated with warfarin and improve the predictive accuracy, an appropriate partition of majority and minority groups, together with an oversampling method, is required.MethodTo solve the data-imbalance problem mentioned above, we developed a clustering-based oversampling technique denoted as DBCSMOTE, which combines density-based spatial clustering of application with noise (DBCSCAN) and synthetic minority oversampling technique (SMOTE). DBCSMOTE automatically finds the minority groups by acquiring the association between samples in terms of the clinical features/genotypes and the warfarin dosage, and creates an extended dataset by adding the new synthetic samples of majority and minority groups. Meanwhile, two ensemble models, boosted regression tree (BRT) and random forest (RF), which are built on the extended dataset generateed by DBCSMOTE, accomplish the task of warfarin daily dosage prediction.ResultsDBCSMOTE and the comparison methods were tested on the datasets derived from our Hospital and International Warfarin Pharmacogenetics Consortium (IWPC). As the results, DBCSMOTE-BRT obtained the highest R-squared (R2) of 0.424 and the smallest mean squared error (mse) of 1.08. In terms of the percentage of patients whose predicted dose of warfarin is within 20% of the actual stable therapeutic dose (20%-p), DBCSMOTE-BRT can achieve the largest value of 47.8% among predictive models. The more important thing is that DBCSMOTE saved about 68% computational time to achieve the same or better performance than the Evolutionary SMOTE, which was the best oversampling method in warfarin dose prediction by far. Meanwhile, in warfarin dose prediction, it is discovered that DBCSMOTE is more effective in integrating BRT than RF for warfarin dose prediction.ConclusionOur finding is that the genotypes, CYP2C9 and VKORC1, no doubt contribute to the predictive accuracy. It was also discovered left atrium diameter, glutamic pyruvic transaminase and serum creatinine included in the model actually improved the predictive accuracy; When congestive heart failure, diabetes mellitus and valve replacement were absent in DBCSMOTE-BRT/RF, the predictive accuracy of DBCSMOTE-BRT/RF decreased. The oversampling ratio and number of minority clusters have a large impact on the effect of oversampling. According to our test, the predictive accuracy was high when the number of minority clusters was 6 ~ 8. The oversampling ratio for small minority clusters should be large (> 1.2) and for large minority clusters should be small (< 0.2). If the dataset becomes larger, the DBCSMOTE would be re-optimized and its BRT/RF model should be re-trained. DBCSMOTE-BRT/RF outperformed the current commonly-used tool called Warfarindosing. As compared to Evolutionary SMOTE-BRT and RF models, DBCSMOTE-BRT and RF models take only a small computational time to achieve the same or higher performance in many cases. In terms of predictive accuracy, RF is not as good as BRT. However, RF still has a powerful ability in generating a highly accurate model as the dataset increases; the software “WarfarinSeer v2.0” is a test version, which packed DBCSMOTE-BRT/RF. It could be a convenient tool for clinical application in warfarin treatment.

Highlights

Vitamin K antagonist is the most classical and widely used oral anticoagulant with assuring anticoagulant effect, wide clinical indications and low price
In terms of the percentage of patients whose predicted dose of warfarin is within 20% of the actual stable therapeutic dose (20%-p), DBCSMOTE-boosted regression tree (BRT) can achieve the largest value of 47.8% among predictive models
It was discovered left atrium diameter, glutamic pyruvic transaminase and serum creatinine included in the model improved the predictive accuracy; When congestive heart failure, diabetes mellitus and valve replacement were absent in DBCSMOTE-BRT/random forest (RF), the predictive accuracy of DBCSMOTE-BRT/RF decreased

Summary

Introduction

Vitamin K antagonist (warfarin) is the most classical and widely used oral anticoagulant with assuring anticoagulant effect, wide clinical indications and low price. Vitamin K antagonist is the most classical and widely used oral anticoagulant with assuring anticoagulant effect, wide clinical indications and low price. New oral anticoagulants (NOACs) are easy to use, they are relatively expensive, and their indications are relatively limited. Once the individual initial dosage can be accurately predicted, the number of dosage adjustments before stabilization can be reduced, anticoagulation effectiveness and safeness of warfarin can be improved, and the mortality of thromboembolism can be reduced. Individualized dose prediction of warfarin is a hot research topic in the field of anticoagulant individualized therapy in recent years. With the long-term accumulation of warfarin medical data, the volume and information integrity of the data are increasing, which provides a basis for the establishment of individual precise dose prediction model of warfarin by machine learning method

Methods

Results

Discussion

Conclusion