Sort by
Assessing the effects of data drift on the performance of machine learning models used in clinical sepsis prediction

BackgroundData drift can negatively impact the performance of machine learning algorithms (MLAs) that were trained on historical data. As such, MLAs should be continuously monitored and tuned to overcome the systematic changes that occur in the distribution of data. In this paper, we study the extent of data drift and provide insights about its characteristics for sepsis onset prediction. This study will help elucidate the nature of data drift for prediction of sepsis and similar diseases. This may aid with the development of more effective patient monitoring systems that can stratify risk for dynamic disease states in hospitals. MethodsWe devise a series of simulations that measure the effects of data drift in patients with sepsis, using electronic health records (EHR). We simulate multiple scenarios in which data drift may occur, namely the change in the distribution of the predictor variables (covariate shift), the change in the statistical relationship between the predictors and the target (concept shift), and the occurrence of a major healthcare event (major event) such as the COVID-19 pandemic. We measure the impact of data drift on model performances, identify the circumstances that necessitate model retraining, and compare the effects of different retraining methodologies and model architecture on the outcomes. We present the results for two different MLAs, eXtreme Gradient Boosting (XGB) and Recurrent Neural Network (RNN). ResultsOur results show that the properly retrained XGB models outperform the baseline models in all simulation scenarios, hence signifying the existence of data drift. In the major event scenario, the area under the receiver operating characteristic curve (AUROC) at the end of the simulation period is 0.811 for the baseline XGB model and 0.868 for the retrained XGB model. In the covariate shift scenario, the AUROC at the end of the simulation period for the baseline and retrained XGB models is 0.853 and 0.874 respectively. In the concept shift scenario and under the mixed labeling method, the retrained XGB models perform worse than the baseline model for most simulation steps. However, under the full relabeling method, the AUROC at the end of the simulation period for the baseline and retrained XGB models is 0.852 and 0.877 respectively. The results for the RNN models were mixed, suggesting that retraining based on a fixed network architecture may be inadequate for an RNN. We also present the results in the form of other performance metrics such as the ratio of observed to expected probabilities (calibration) and the normalized rate of positive predictive values (PPV) by prevalence, referred to as lift, at a sensitivity of 0.8. ConclusionOur simulations reveal that retraining periods of a couple of months or using several thousand patients are likely to be adequate to monitor machine learning models that predict sepsis. This indicates that a machine learning system for sepsis prediction will probably need less infrastructure for performance monitoring and retraining compared to other applications in which data drift is more frequent and continuous. Our results also show that in the event of a concept shift, a full overhaul of the sepsis prediction model may be necessary because it indicates a discrete change in the definition of sepsis labels, and mixing the labels for the sake of incremental training may not produce the desired results.

Open Access
Relevant
A machine learning approach for unbiased diagnosis of acute coronary syndrome in the emergency department

Abstract Importance: Despite sex and race disparities in the symptom presentation, diagnosis, and management of acute coronary syndrome (ACS), these differences have not been investigated in the development and validation of machine learning (ML) models using individualized patient information from electronic health records (EHRs) to diagnose ACS. Objective: To evaluate ML-based ACS diagnosis performance across different subpopulations in a multi-site emergency department (ED) setting and determine how bias mitigating techniques influence ML performance. Design, Setting, and Participants: This retrospective observational study included data from 2,334,316 ED patients ( >18 years) from January 2007 to June 2020. Exposure: Logistic regression (LR) and neural network (NN) models were assessed in ED encounters grouped by sex, race, presence or absence of chest pain, EHR data quality, and timeliness of several key ED procedures. Prejudice regularization, reweighting, and within-subpopulation training were evaluated for bias mitigation. Main Outcomes/Measures: Metrics including area under the receiver operating characteristic (AUROC) were used to assess performances. Results: We analyzed 4,268,165 ED visits in which patient demographics by race were 67.40% White, 19.20% Black, 2.40% Asian, and 11.00% Other or Unknown. Patient composition was 54.80% female and 45.20% male. Both models’ AUROCs were significantly higher in White vs. Black patients (LR: z-score = 3.23 and NN: 4.26 for NN; P < 0.0006), in males vs. females (z-score = 3.81 for LR and 4.16 for NN; P < 0.0001) and in no chest pain subpopulation vs. chest pain (z-score = 13.32 for LR and 17.70 for NN; P < 0.0001). Prejudice regularization and reweighting techniques did not reduce biases. Training in race-specific and sex-specific training populations also did not yeild statistically signficant improvements in ML algorithm performance. Chest pain-specific training led to significantly improved AUROC.Conclusion: EHR-derived ML models trained and tested within similar demographic subpopulations and symptom groups may perform better than ML models that are trained in random populations, and provide less biased clinical decision support for ACS diagnosis.

Relevant
Machine Learning to Predict Long and Short Term Fracture Risk in Postmenopausal Women

Abstract Purpose Fractures in older adults are a significant cause of morbidity and mortality, particularly for post-menopausal women with osteoporosis. Prevention is key for managing fractures in this population and may include identifying individuals at high fracture risk and providing therapeutic treatment to mitigate risk. This study aimed to develop a machine learning fracture risk prediction tool to overcome the limitations of existing methods by incorporating additional risk factors and providing short-term risk predictions. Methods We developed a machine learning model to predict risk of major osteoporotic fractures and femur (hip) fractures in a retrospective cohort of post-menopausal women. Models were trained to generate predictions at 3, 5, and 10 year prediction windows. The model used only ICD codes, basic demographics, vital sign measurements, lab results and medication usage from a proprietary national longitudinal electronic health record repository to make predictions. Results The algorithms obtained area under the receiver operating characteristic values of 0.83, 0.81, and 0.79 for prediction of major osteoporotic fractures at 3, 5, and 10 year windows, respectively. The algorithms also obtained AUROC values of 0.79, 0.75, and 0.75 for prediction of femur fractures at 3, 5, and 10 year windows, respectively. For all models, when sensitivity was fixed at 0.80, average specificity was 0.615. Conclusion Machine learning clinical decision support may inform clinical efforts at early detection of high-risk individuals, mitigating their risk and for establishing clinical research cohorts with well-defined patient populations.

Relevant
Risk assessment of acute respiratory failure requiring advanced respiratory support using machine learning

Abstract Background: Acute respiratory failure (ARF) presents within a spectrum of clinical manifestations and illness severity, and mortality occurs in approximately 30% of patients who develop ARF. Early risk identification is imperative for implementation of prophylactic measures prior to ARF onset. In this study, we develop and validate a machine learning algorithm (MLA) to predict patients at risk of ARF requiring advanced respiratory support. Methods: This retrospective study used data from 155,725 patient electronic health records obtained from five United States community hospitals. An XGBoost classifier was developed using patient EHR data to produce risk scores at 3-hour intervals to predict the risk of ARF within 24 hours. An alert was generated only once prior to ARF onset, defined by implementation of advanced respiratory support, for patients whose risk score exceeded a predefined threshold. We used a novel time-sensitive area under the receiver operating characteristic (tAUROC) curve that integrated the timing of the alert relative to ARF onset to evaluate the accuracy of the MLA. The MLA was assessed on two testing sets and compared with oxygen saturation (SpO2) measurement and the modified early warning score (MEWS). Results: The MLA demonstrated significantly higher eSensitivity and specificity operating points on the temporal testing and external validation sets (tAUROC of 0.858/0.883, respectively) than SpO2 (0.771/0.810) and MEWS (0.676/0.774) for prediction of ARF requiring advanced respiratory support. The MLA also achieved lower false positive rates than SpO2 and MEWS at these operating points. Conclusions: The MLA predicts patients at risk of ARF requiring advanced respiratory support and achieves higher accuracy and produces earlier alerts than the use of SpO2 or MEWS. Importantly for clinical practice, the MLA has a lower false positive rate than these comparators while maintaining high sensitivity and specificity.

Relevant