Abstract

Introduction: Cardiovascular disease (CVD) is the leading cause of death in the U.S. and globally. Predictive models are essential for identifying high CVD-risk patients. Machine learning (ML) models have been used to develop predictive models. However, the dataset generalized for training ML models could involve a systematic bias (e.g., sampling bias), which could bias the model. The study investigated the performance of different methods to reduce bias for 10-year CVD prediction models among demographic groups (White and Black, Male and Female). Methods: We used a large cohort derived from VUMC de-identified EHR data, including outpatients with follow-up visits between 2007 and 2017. Logistic regression (LR), random forests (RF), and gradient boosting trees (GBT) were trained. Predictors included traditional CVD risk factors, labs, prior diagnosis, and medications. Fairness metrics were equal opportunity difference (EOD) - a difference of true positive rate (0 indicates fairness), and disparate impact (DI) - a ratio of predicted CVD (1 indicates fairness) among groups. Three de-biasing methods were applied, removing protected attributes (i.e., race, gender) from the model, resampling the training set by sample size, and by the proportion of CVD. Results: The study included 109, 490 individuals (9,824 CVDs) with mean (SD) age of 47.4 (14.7) years (White 86.3%, Black 13.7%, Male 35.5% and Female 64.5%). Models had a slight bias across race groups but a large EOD and DI across genders (i.e. models has higher prediction accuracy in Male than Female). After debiasing, EOD and DI were significantly improved for genders (Fig. 1). Resampling by size reduced bias while retaining accuracy (e.g. 0.78 AUROC for GBT). Conclusions: Among the VUMC cohort, despite the high proportion of women, ML models had a lower prediction accuracy for the women group. Resampling the training set data reduced the bias, but it should be used with caution depending on the magnitude and nature of the bias.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.