Risk of bias in machine learning and statistical models to predict height or weight: a systematic review in fetal and paediatric medicine.
Prediction of suboptimal growth allows early intervention that can improve outcomes for developing fetus' as well as infants and children. We investigate the risk of bias in statistical or machine learning models to predict the height or weight of a fetus, infant or child under 20 years of age to inform the current standard of research and provide insight into why equations developed over 30 years ago are still recommended for use by national professional bodies. We systematically searched MEDLINE and EMBASE for peer reviewed original research studies published in 2022. We included studies if they developed or validated a multivariable model to predict height or weight of an individual using two or more variables, excluding studies assessing imaging or using genetics or metabolomics information. Risk of bias was assessed for all prediction models and analyses using the Prediction model Risk Of Bias ASsessment Tool (PROBAST). Sixty-four studies were included, in which we assessed the development of 180 models and validation of 61 models. Sample size was only considered in 10% of developed models and 13% of validated models. Despite height and weight being continuous variables, 77% of models developed predicted a dichotomised outcome variable. The review was registered on PROSPERO (ID: CRD42023421146), the International prospective register of systematic reviews on 26/4/2023.
- Research Article
2
- 10.1016/j.jclinepi.2025.111732
- May 1, 2025
- Journal of clinical epidemiology
Since 2019, the Prediction model Risk Of Bias ASsessment Tool (PROBAST; www.probast.org) has supported methodological quality assessments of prediction model studies. Most prediction model studies are rated with a "High" risk of bias (ROB) and researchers report low interrater reliability (IRR) using PROBAST. We aimed to (1) assess the IRR of PROBAST ratings between assessors of the same study and understand reasons for discrepancies, (2) determine which items contribute most to domain-level ROB ratings, and (3) explore the impact of consensus meetings. We used PROBAST assessments from a systematic review of diagnostic and prognostic COVID-19 prediction models as a case study. Assessors included international experts in prediction model studies or their reviews. We assessed IRR using prevalence-adjusted bias-adjusted kappa (PABAK) before consensus meetings, examined bias ratings per domain-level ROB judgments, and evaluated the impact of consensus meetings by identifying rating changes after discussion. We analyzed 2167 PROBAST assessments from 27 assessor pairs covering 760 prediction models: 384 developments, 242 validations, and 134 mixed assessments (including both). The IRR using PABAK was higher for overall ROB judgments (development: 0.82 [0.76; 0.89]; validation: 0.78 [0.68; 0.88]) compared to domain- and item-level judgments. Some PROBAST items frequently contributed to domain-level ROB judgments, eg, 3.5 Outcome blinding and 4.1 Sample size. Consensus discussions mainly led to item-level and never to overall ROB rating changes. Within this case study, PROBAST assessments received high IRR at the overall ROB level, with some variation at item- and domain-level. To reduce variability, PROBAST assessors should standardize item- and domain-level judgments and hold well-structured consensus meetings between assessors of the same study. The Prediction model Risk Of Bias ASsessment Tool (PROBAST; www.probast.org) provides a set of items to assess the quality of medical studies on so-called prediction tools that calculate an individual's probability of having or developing a certain disease or health outcome. Previous research found low interrater reliability (IRR; ie, how consistently two assessors rate aspects of the same study) when using PROBAST. To understand why this is the case, we conducted a large study involving more than 30 experts from around the world, all of whom applied PROBAST to the same set of prediction tool studies. Based on more than 2150 PROBAST assessments, we identified which PROBAST items led to the most disagreements between raters, explored reasons for these disagreements, and examined whether the use of so-called consensus meetings (ie, different assessors of the same study discuss their ratings and decide on a finalized rating) impacted PROBAST ratings. Our study found that the IRR between different assessors of the same study was higher than previously reported. One explanation for the better agreement compared to previous research may be the preplanning on how to assess certain PROBAST aspects before starting the assessments, as well as holding well-structured consensus meetings. These improvements lead to a more effective use of PROBAST in evaluating the trustworthiness and quality of prediction tools in the health-care domain.
- Research Article
- 10.1016/j.chiabu.2025.107630
- Nov 1, 2025
- Child abuse & neglect
Understanding the development, performance, fairness, and transparency of machine learning models used in child protection prediction: A systematic review.
- Research Article
15
- 10.21037/atm-22-5986
- Dec 1, 2022
- Annals of Translational Medicine
In the era of precision therapy, early classification of breast cancer (BRCA) molecular subtypes has clinical significance for disease management and prognosis. We explored the accuracy of machine learning (ML) models for early classification of BRCA molecular subtypes through a systematic review of the literature currently available. We retrieved relevant studies published in PubMed, EMBASE, Cochrane, and Web of Science until 15 April 2022. A prediction model risk of bias assessment tool (PROBAST) was applied for the assessment of risk of bias of a genomics-based ML model, and the Radiomics Quality Score (RQS) was simultaneously used to evaluate the quality of this radiomics-based ML model. A random effects model was adopted to analyze the predictive accuracy of genomics-based ML and radiomics-based ML for Luminal A, Luminal B, Basal-like or triple-negative breast cancer (TNBC), and human epidermal growth factor receptor 2 (HER2). The PROSPERO of our study was prospectively registered (CRD42022333611). Of the 38 studies were selected for analysis, 14 ML models were based on gene-transcriptomic, with only 4 external validations; and 43 ML models were based on radiomics, with only 14 external validations. Meta-analysis results showed that c-statistic values of the ML based on radiomics for the identification of BRCA molecular subtypes Luminal A, Luminal B, Basal-like or TNBC, and HER2 were 0.76 [95% confidence interval (CI): 0.60-0.96], 0.78 (95% CI: 0.69-0.87), 0.89 (95% CI: 0.83-0.91), and 0.83 (95% CI: 0.81-0.86), respectively. The c-statistic values of ML based on the gene-transcriptomic analysis cohort for the identification of the previously described BRCA molecular subtypes were 0.96 (95% CI: 0.93-0.99), 0.96 (95% CI: 0.93-0.99), 0.98 (95% CI: 0.95-1.00), and 0.97 (95% CI: 0.96-0.98) respectively. Additionally, the sensitivity of the ML model based on radiomics for each molecular subtype ranged from 0.79 to 0.85, while the sensitivity of the ML model based on gene-transcriptomic was between 0.92 and 0.99. Both radiomics and gene transcriptomics produced ideal effects on BRCA molecular subtype prediction. Compared with radiomics, gene transcriptomics yielded better prediction results, but radiomics was simpler and more convenient from a clinical point of view.
- Research Article
70
- 10.2196/26634
- Mar 16, 2022
- Journal of Medical Internet Research
BackgroundGestational diabetes mellitus (GDM) is a common endocrine metabolic disease, involving a carbohydrate intolerance of variable severity during pregnancy. The incidence of GDM-related complications and adverse pregnancy outcomes has declined, in part, due to early screening. Machine learning (ML) models are increasingly used to identify risk factors and enable the early prediction of GDM.ObjectiveThe aim of this study was to perform a meta-analysis and comparison of published prognostic models for predicting the risk of GDM and identify predictors applicable to the models.MethodsFour reliable electronic databases were searched for studies that developed ML prediction models for GDM in the general population instead of among high-risk groups only. The novel Prediction Model Risk of Bias Assessment Tool (PROBAST) was used to assess the risk of bias of the ML models. The Meta-DiSc software program (version 1.4) was used to perform the meta-analysis and determination of heterogeneity. To limit the influence of heterogeneity, we also performed sensitivity analyses, a meta-regression, and subgroup analysis.ResultsA total of 25 studies that included women older than 18 years without a history of vital disease were analyzed. The pooled area under the receiver operating characteristic curve (AUROC) for ML models predicting GDM was 0.8492; the pooled sensitivity was 0.69 (95% CI 0.68-0.69; P<.001; I2=99.6%) and the pooled specificity was 0.75 (95% CI 0.75-0.75; P<.001; I2=100%). As one of the most commonly employed ML methods, logistic regression achieved an overall pooled AUROC of 0.8151, while non–logistic regression models performed better, with an overall pooled AUROC of 0.8891. Additionally, maternal age, family history of diabetes, BMI, and fasting blood glucose were the four most commonly used features of models established by the various feature selection methods.ConclusionsCompared to current screening strategies, ML methods are attractive for predicting GDM. To expand their use, the importance of quality assessments and unified diagnostic criteria should be further emphasized.
- Research Article
31
- 10.3389/fcvm.2022.812276
- Apr 6, 2022
- Frontiers in Cardiovascular Medicine
ObjectiveTo compare the performance, clinical feasibility, and reliability of statistical and machine learning (ML) models in predicting heart failure (HF) events.BackgroundAlthough ML models have been proposed to revolutionize medicine, their promise in predicting HF events has not been investigated in detail.MethodsA systematic search was performed on Medline, Web of Science, and IEEE Xplore for studies published between January 1, 2011 to July 14, 2021 that developed or validated at least one statistical or ML model that could predict all-cause mortality or all-cause readmission of HF patients. Prediction Model Risk of Bias Assessment Tool was used to assess the risk of bias, and random effect model was used to evaluate the pooled c-statistics of included models.ResultTwo-hundred and two statistical model studies and 78 ML model studies were included from the retrieved papers. The pooled c-index of statistical models in predicting all-cause mortality, ML models in predicting all-cause mortality, statistical models in predicting all-cause readmission, ML models in predicting all-cause readmission were 0.733 (95% confidence interval 0.724–0.742), 0.777 (0.752–0.803), 0.678 (0.651–0.706), and 0.660 (0.633–0.686), respectively, indicating that ML models did not show consistent superiority compared to statistical models. The head-to-head comparison revealed similar results. Meanwhile, the immoderate use of predictors limited the feasibility of ML models. The risk of bias analysis indicated that ML models' technical pitfalls were more serious than statistical models'. Furthermore, the efficacy of ML models among different HF subgroups is still unclear.ConclusionsML models did not achieve a significant advantage in predicting events, and their clinical feasibility and reliability were worse.
- Research Article
2
- 10.3389/fonc.2025.1555247
- Apr 14, 2025
- Frontiers in oncology
This study aimed to evaluate the quality and transparency of reporting in studies using machine learning (ML) in oncology, focusing on adherence to the Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Models (CREMLS), TRIPOD-AI (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis), and PROBAST (Prediction Model Risk of Bias Assessment Tool). The literature search included primary studies published between February 1, 2024, and January 31, 2025, that developed or tested ML models for cancer diagnosis, treatment, or prognosis. To reflect the current state of the rapidly evolving landscape of ML applications in oncology, fifteen most recent articles in each category were selected for evaluation. Two independent reviewers screened studies and extracted data on study characteristics, reporting quality (CREMLS and TRIPOD+AI), risk of bias (PROBAST), and ML performance metrics. The most frequently studied cancer types were breast cancer (n=7/45; 15.6%), lung cancer (n=7/45; 15.6%), and liver cancer (n=5/45; 11.1%). The findings indicate several deficiencies in reporting quality, as assessed by CREMLS and TRIPOD+AI. These deficiencies primarily relate to sample size calculation, reporting on data quality, strategies for handling outliers, documentation of ML model predictors, access to training or validation data, and reporting on model performance heterogeneity. The methodological quality assessment using PROBAST revealed that 89% of the included studies exhibited a low overall risk of bias, and all studies have shown a low risk of bias in terms of applicability. Regarding the specific AI models identified as the best-performing, Random Forest (RF) and XGBoost were the most frequently reported, each used in 17.8% of the studies (n = 8). Additionally, our study outlines the specific areas where reporting is deficient, providing researchers with guidance to improve reporting quality in these sections and, consequently, reduce the risk of bias in their studies.
- Research Article
48
- 10.1016/j.jclinepi.2021.06.017
- Jun 24, 2021
- Journal of Clinical Epidemiology
Large-scale validation of the prediction model risk of bias assessment Tool (PROBAST) using a short form: high risk of bias models show poorer discrimination
- Research Article
1
- 10.21037/tcr-23-859
- Sep 1, 2023
- Translational Cancer Research
Radiotherapy is a common treatment for nasopharyngeal carcinoma (NPC) but can cause radiation-induced temporal lobe injury (RTLI), resulting in irreversible damage. Predicting RTLI at the early stage may help with that issue by personalized adjustment of radiation dose based on the predicted risk. Machine learning (ML) models have recently been used to predict RTLI but their predictive accuracy remains unclear because the reported concordance index (C-index) varied widely from around 0.31 to 0.97. Therefore, a meta-analysis was needed. The PubMed, Web of Science, Embase, and Cochrane Library databases were searched from inception to November 2022. Studies that fully develop one or more ML risk models of RTLI after radiotherapy for NPC were included. The Prediction model Risk Of Bias Assessment Tool (PROBAST) was used to assess the risk of bias in the included research. The primary outcome of this review was the C-index, specificity (Spe), and sensitivity (Sen). The meta-analysis included 14 studies with 15,573 NPC patients reporting a total of 72 prediction models. Overall, 94.44% of models were found to have a high risk of bias. Radiomics was included in 57 models, dosimetric predictors in 28, and clinical data in 27. The pooled C-index for ML models predicting RTLI was 0.77 [95% confidence interval (CI): 0.75-0.79] in the training set and 0.78 (95% CI: 0.75-0.81) in the validation set. The pooled Sen was 0.75 (95% CI: 0.69-0.80) in the training set and 0.70 (95% CI: 0.66-0.73) in the validation set and the pooled Spe was 0.78 (95% CI: 0.73-0.82) in the training set and 0.79 (95% CI: 0.75-0.82) in the validation set. Models with radiomics and clinical data achieved the most excellent discriminative performance, with a pooled C-index of 0.895. ML models can accurately predict RTLI at an early stage, allowing for timely interventions to prevent further damage. The kind of ML methods and the selection of predictors may influence the predictive accuracy.
- Research Article
- 10.1017/s0266462322001799
- Dec 1, 2022
- International Journal of Technology Assessment in Health Care
IntroductionRisk prediction models, using either machine learning or statistical algorithms, can act as inputs of a cost-effectiveness model when predicting costs and effectiveness of an intervention. This systematic review has two objectives: to evaluate methodological quality of the published models to predict diabetic coronary heart disease (CHD) risk; to evaluate whether the models were sufficiently reported to judge their applicability to the cost-effectiveness modelling.MethodsA targeted review of journal articles published in English, Dutch, Chinese, or Spanish was undertaken in PubMed, Embase, Scopus, Web of Science, and IEEE Explore from 1 January, 2016 to 31 May, 2021. To assess the methodological quality and reporting of the models, we used PROBAST (Prediction model Risk Of Bias Assessment Tool), CHARMS (a Checklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies), and a checklist (Betts 2019) summarizing the application of cardiovascular risk prediction models to health technology assessment.ResultsOur search retrieved 6,579 hits, of which 18 models were eligible for inclusion. Among them, four studies developed machine learning models (2 recurrent neural networks, 1 random forest models, and 1 multi-task learning model) while 14 studies developed statistical models (8 Cox models, 5 logistic models, and 1 microsimulation model). More than 70 percent of models were of high methodological quality in aspects of participants (89%), predictors (72%), and outcomes (72%), while only five models (28%) in aspects of statistical analysis. For the reporting, only two models provided sufficient evidence in all aspects (i.e., participants, predictors, and outcomes) for judging their applicability to the cost-effectiveness modelling. Most models were reported sufficiently regarding participants (78%) and outcomes (72%), but only three models regarding predictors (17%).ConclusionsTo apply the CHD risk prediction models to cost-effectiveness modelling, concerns remain regarding the potential risk of bias due to inappropriate use of analysis methods, and regarding insufficient reporting on how to measure and assess the predictors.
- Research Article
1
- 10.3389/fendo.2025.1495306
- Mar 3, 2025
- Frontiers in endocrinology
Machine learning (ML) models are being increasingly employed to predict the risk of developing and progressing diabetic kidney disease (DKD) in patients with type 2 diabetes mellitus (T2DM). However, the performance of these models still varies, which limits their widespread adoption and practical application. Therefore, we conducted a systematic review and meta-analysis to summarize and evaluate the performance and clinical applicability of these risk predictive models and to identify key research gaps. We conducted a systematic review and meta-analysis to compare the performance of ML predictive models. We searched PubMed, Embase, the Cochrane Library, and Web of Science for English-language studies using ML algorithms to predict the risk of DKD in patients with T2DM, covering the period from database inception to April 18, 2024. The primary performance metric for the models was the area under the receiver operating characteristic curve (AUC) with a 95% confidence interval (CI). The risk of bias was assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST) checklist. 26 studies that met the eligibility criteria were included into the meta-analysis. 25 studies performed internal validation, but only 8 studies conducted external validation. A total of 94 ML models were developed, with 81 models evaluated in the internal validation sets and 13 in the external validation sets. The pooled AUC was 0.839 (95% CI 0.787-0.890) in the internal validation and 0.830 (95% CI 0.784-0.877) in the external validation sets. Subgroup analysis based on the type of ML showed that the pooled AUC for traditional regression ML was 0.797 (95% CI 0.777-0.816), for ML was 0.811 (95% CI 0.785-0.836), and for deep learning was 0.863 (95% CI 0.825-0.900). A total of 26 ML models were included, and the AUCs of models that were used three or more times were pooled. Among them, the random forest (RF) models demonstrated the best performance with a pooled AUC of 0.848 (95% CI 0.785-0.911). This meta-analysis demonstrates that ML exhibit high performance in predicting DKD risk in T2DM patients. However, challenges related to data bias during model development and validation still need to be addressed. Future research should focus on enhancing data transparency and standardization, as well as validating these models' generalizability through multicenter studies. https://inplasy.com/inplasy-2024-9-0038/, identifier INPLASY202490038.
- Research Article
2
- 10.1371/journal.pone.0307531
- Jul 24, 2024
- PloS one
This systematic review aimed to evaluate the performance of machine learning (ML) models in predicting post-treatment survival and disease progression outcomes, including recurrence and metastasis, in head and neck cancer (HNC) using clinicopathological structured data. A systematic search was conducted across the Medline, Scopus, Embase, Web of Science, and Google Scholar databases. The methodological characteristics and performance metrics of studies that developed and validated ML models were assessed. The risk of bias was evaluated using the Prediction model Risk Of Bias ASsessment Tool (PROBAST). Out of 5,560 unique records, 34 articles were included. For survival outcome, the ML model outperformed the Cox proportional hazards model in time-to-event analyses for HNC, with a concordance index of 0.70-0.79 vs. 0.66-0.76, and for all sub-sites including oral cavity (0.73-0.89 vs. 0.69-0.77) and larynx (0.71-0.85 vs. 0.57-0.74). In binary classification analysis, the area under the receiver operating characteristics (AUROC) of ML models ranged from 0.75-0.97, with an F1-score of 0.65-0.89 for HNC; AUROC of 0.61-0.91 and F1-score of 0.58-0.86 for the oral cavity; and AUROC of 0.76-0.97 and F1-score of 0.63-0.92 for the larynx. Disease-specific survival outcomes showed higher performance than overall survival outcomes, but the performance of ML models did not differ between three- and five-year follow-up durations. For disease progression outcomes, no time-to-event metrics were reported for ML models. For binary classification of the oral cavity, the only evaluated subsite, the AUROC ranged from 0.67 to 0.97, with F1-scores between 0.53 and 0.89. ML models have demonstrated considerable potential in predicting post-treatment survival and disease progression, consistently outperforming traditional linear models and their derived nomograms. Future research should incorporate more comprehensive treatment features, emphasize disease progression outcomes, and establish model generalizability through external validations and the use of multicenter datasets.
- Research Article
- 10.1200/jco.2021.39.15_suppl.e13559
- May 20, 2021
- Journal of Clinical Oncology
e13559 Background: Short-term cancer mortality prediction has many implications concerning care planning. An accurate prognosis allows healthcare providers to adjust care plans and take appropriate actions, such as initiating end-of-life conversations. Machine learning (ML) techniques demonstrated promising capability to support clinical decision-making via providing reliable predictions for a variety of clinical outcomes, including cancer mortality. However, the evidence has not yet been systematically synthesized and evaluated. The objective of this review was to examine the performance and risk-of-bias for ML models trained to predict short-term (≤ 12 months) cancer mortality. Methods: We identified relevant literature from five electronic databases: Ovid Medline, Ovid EMBASE, Scopus, Web of Science, and IEEE Xplore. We searched each database with predefined MeSH terms and keywords of oncology, machine learning, and mortality using AND/OR statements. Inclusion criteria included: 1) developed/validated ML models for predicting oncology patient mortality within one year using electronic health record data; 2) reported model performance within a dataset that was not used to train the models; 3) original research; 4) peer-reviewed full paper in English; 5) published before 1/10/2020. We conducted risk of bias assessment using prediction model risk of bias assessment tool (PROBAST). Results: Ten articles were included in this review. Most studies focused on predicting 1-year mortality (n = 6) for multiple types of cancer (n = 5). Most studies (n = 7) used a single metric, the area under the receiver operating characteristic curve (AUROC), to examine their models. The AUROC ranged from .69 to .91, with a median of .85. Information on samples (n = 10), resampling methods (n = 6), model tuning approaches (n = 9), censoring (n = 10), and sample size determinations (n = 10) were incomplete or absent. Six studies have a high risk of bias for the analysis domain in the PROBAST. Conclusions: The performance of ML models for short-term cancer mortality appears promising. However, most studies report only a single performance metric that obfuscates evaluation of a model’s true performance. This is especially problematic when predicting rare events such as short-term mortality. We found little-to-no information on a given model’s ability to correctly identify patients at high risk of mortality. The incomplete reporting of model development poses challenges to risk of bias assessment and reduces the confidence in the results. Our findings suggest that future studies should report comprehensive performance metrics using a standard reporting guideline, such as transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD), to ensure sufficient information for replication, justification, and adoption.
- Research Article
2
- 10.1186/s41512-022-00119-9
- Mar 24, 2022
- Diagnostic and Prognostic Research
BackgroundWith rising cost pressures on health care systems, machine-learning (ML)-based algorithms are increasingly used to predict health care costs. Despite their potential advantages, the successful implementation of these methods could be undermined by biases introduced in the design, conduct, or analysis of studies seeking to develop and/or validate ML models. The utility of such models may also be negatively affected by poor reporting of these studies. In this systematic review, we aim to evaluate the reporting quality, methodological characteristics, and risk of bias of ML-based prediction models for individual-level health care spending.MethodsWe will systematically search PubMed and Embase to identify studies developing, updating, or validating ML-based models to predict an individual’s health care spending for any medical condition, over any time period, and in any setting. We will exclude prediction models of aggregate-level health care spending, models used to infer causality, models using radiomics or speech parameters, models of non-clinically validated predictors (e.g., genomics), and cost-effectiveness analyses without predicting individual-level health care spending. We will extract data based on the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies (CHARMS), previously published research, and relevant recommendations. We will assess the adherence of ML-based studies to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement and examine the inclusion of transparency and reproducibility indicators (e.g. statements on data sharing). To assess the risk of bias, we will apply the Prediction model Risk Of Bias Assessment Tool (PROBAST). Findings will be stratified by study design, ML methods used, population characteristics, and medical field.DiscussionOur systematic review will appraise the quality, reporting, and risk of bias of ML-based models for individualized health care cost prediction. This review will provide an overview of the available models and give insights into the strengths and limitations of using ML methods for the prediction of health spending.
- Preprint Article
- 10.2196/preprints.65708
- Aug 23, 2024
BACKGROUND The quality of a machine learning model considerably relies on the size of the dataset, the development and widespread application of this method have often been hindered by confidentiality issues, particularly regarding data privacy. Predicting mortality is essential in clinical environments. When a patient is admitted, estimating their likelihood of mortality by the end of their intensive care unit (ICU) stay or within a designated time frame is a way to assess the severity of their condition. This information is crucial in managing treatment planning and resource allocation. However, individual hospitals typically have a limited amount of local data available to create a reliable model. The rise of federated learning as a novel privacy-preserving technology offers the potential for collaboratively creating models in a decentralized manner, eliminating the need to consolidate all datasets in a single location. Nonetheless, there is a scarce of clear and comprehensive evidence that compares the performance of federated learning with that of traditional centralized machine learning approaches, particularly considering healthcare implementation. OBJECTIVE This study aims to review the comparison of performances between federated learning (FL)-based and centralized machine learning (CML) models for mortality prediction in clinical settings. METHODS The electronic database search was conducted for English articles that developed federated-based learning model to predict mortality. Screening, data extraction, and risk of bias assessments were carried out by at least two independent reviewers. Meta-analyses of pooled area under the receiver operating curve (AUROC/AUC) values were examined for FL, CML, and LML. The risk of bias was assessed using critical appraisal and data extraction for systematic reviews of prediction modeling studies (CHARMS) and prediction model risk of bias assessment tool (PROBAST) guidelines RESULTS In total, 9 articles that were heterogeneous in framework design, scenario, and clinical context were included (n = 5 [55.6%] were observed in specific case; n = 3 [33.0%] were in ICU settings; and n = 2 [22.0%] in emergency department, urgent, or trauma center). Cohort datasets were utilized by all included studies. These studies universally indicated that performance of FL model outperforms LML model and closest to the CML model. The pooled AUC for FL and, CML (or LML) performances were 0.81 (95 % CI 0.76–0.85, I2 78.36 %) and 0.82 (95 % CI 0.77–0.86, I2 72.33 %), respectively. All included studies had either a low, high, or unclear risk of bias. CONCLUSIONS This systematic review and meta-analysis demonstrate that federated learning models outperform local machine learning approaches and are comparable to centralized models. However, efficiency may be compromised due to complexity, privacy preservation, and high computation and communication costs. CLINICALTRIAL PROSPERO International Prospective Register of Systematic Reviews CRD42024539245; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=539245
- Supplementary Content
- 10.1186/s44263-025-00184-4
- Jul 24, 2025
- BMC Global and Public Health
BackgroundHIV treatment interruption remains a significant barrier to achieving global HIV/AIDS control goals. Machine learning (ML) models offer potential for predicting treatment interruption by leveraging large clinical data. Understanding how these models were developed, validated, and applied remains essential for advancing research.MethodsWe searched databases including the PubMed, BMC, Cochrane Library, Scopus, ScienceDirect, Lancet, and Google Scholar, for studies published in English from 1990 to September 2024. Search terms covered HIV, machine learning, treatment interruption, and loss to follow-up. Articles were screened and reviewed independently, and data were extracted using the CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) tool. Risk of bias was assessed with Prediction model Risk Of Bias Assessment Tool (PROBAST). The Preferred Reporting Items for Systematic reviews and Meta-analysis (PRISMA) guidelines were followed throughout.ResultsOut of 116,672 records, 9 studies met the inclusion criteria and reported 12 ML models. Random Forest, XGBoost, and AdaBoost were predominant models (91.7%). Internal validation was performed in all models, but only two models included external validation. Performance varied, with a mean area under the receiver operating characteristic curve (AUC-ROC) of 0.668 (standard deviation (SD) = 0.066), indicating moderate discrimination. About 75% of models showed a high risk of bias due to inadequate handling of missing data, lack of calibration, and the absence of decision curve analysis (DCA).ConclusionsML models show promise for predicting HIV treatment interruption, particularly in resource-limited settings. Future research should prioritize external validation, robust missing data handling, and decision curve analysis and include sociocultural predictors to improve model robustness.Systematic review registrationPROSPERO CRD42024578109.Supplementary InformationThe online version contains supplementary material available at 10.1186/s44263-025-00184-4.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.