Evaluating Bias-Mitigated Predictive Models of Perinatal Mood and Anxiety Disorders

Emily F Wong,Anil K Saini,Eynav E Accortt,Melissa S Wong,Jason H Moore,Tiffani J Bright

doi:10.1001/jamanetworkopen.2024.38152

Abstract

Machine learning for augmented screening of perinatal mood and anxiety disorders (PMADs) requires thorough consideration of clinical biases embedded in electronic health records (EHRs) and rigorous evaluations of model performance. To mitigate bias in predictive models of PMADs trained on commonly available EHRs. This diagnostic study collected data as part of a quality improvement initiative from 2020 to 2023 at Cedars-Sinai Medical Center in Los Angeles, California. The study inclusion criteria were birthing patients aged 14 to 59 years with live birth records and admission to the postpartum unit or the maternal-fetal care unit after delivery. Patient-reported race and ethnicity (7 levels) obtained through EHRs. Logistic regression, random forest, and extreme gradient boosting models were trained to predict 2 binary outcomes: moderate to high-risk (positive) screen assessed using the 9-item Patient Health Questionnaire (PHQ-9), and the Edinburgh Postnatal Depression Scale (EPDS). Each model was fitted with or without reweighing data during preprocessing and evaluated through repeated K-fold cross validation. In every iteration, each model was evaluated on its area under the receiver operating curve (AUROC) and on 2 fairness metrics: demographic parity (DP), and difference in false negatives between races and ethnicities (relative to non-Hispanic White patients). Among 19 430 patients in this study, 1402 (7%) identified as African American or Black, 2371 (12%) as Asian American and Pacific Islander; 1842 (10%) as Hispanic White, 10 942 (56.3%) as non-Hispanic White, 606 (3%) as multiple races, 2146 (11%) as other (not further specified), and 121 (<1%) did not provide this information. The mean (SD) age was 34.1 (4.9) years, and all patients identified as female. Racial and ethnic minority patients were significantly more likely than non-Hispanic White patients to screen positive on both the PHQ-9 (odds ratio, 1.47 [95% CI, 1.23-1.77]) and the EPDS (odds ratio, 1.38 [95% CI, 1.20-1.57]). Mean AUROCs ranged from 0.610 to 0.635 without reweighing (baseline), and from 0.602 to 0.622 with reweighing. Baseline models predicted significantly greater prevalence of postpartum depression for patients who were not non-Hispanic White relative to those who were (mean DP, 0.238 [95% CI, 0.231-0.244]; P < .001) and displayed significantly lower false-negative rates (mean difference, -0.184 [95% CI, -0.195 to -0.174]; P < .001). Reweighing significantly reduced differences in DP (mean DP with reweighing, 0.022 [95% CI, 0.017-0.026]; P < .001) and false-negative rates (mean difference with reweighing, 0.018 [95% CI, 0.008-0.028]; P < .001) between racial and ethnic groups. In this diagnostic study of predictive models of postpartum depression, clinical prediction models trained to predict psychometric screening results from commonly available EHRs achieved modest performance and were less likely to widen existing health disparities in PMAD diagnosis and potentially treatment. These findings suggest that is critical for researchers and physicians to consider their model design (eg, desired target and predictor variables) and evaluate model bias to minimize health disparities.

Full Text