The F score ranks diagnostic tests and prediction models inconsistently with their clinical utility
BackgroundThe F score is derived from precision (positive predictive value) and recall (sensitivity). It is increasingly used to evaluate diagnostic tests and prediction models in the machine learning literature. Although precision and recall can be differentially weighted using a parameter β, almost all applications use equal weighting, with β set to 1.MethodsWe considered a cancer detection scenario to explore the properties of the F score in comparison to net benefit, a well-established method for evaluating the clinical utility of tests and models. Because missing cancer can be fatal and biopsies are an invasive procedure, we would favor a test with high sensitivity.ResultsF scores did not provide a rank ordering of tests consistent with utility. F1 was highest for a test with greater specificity; in contrast, the conventional decision analytic measure, net benefit, rank ordered tests consistent with clinical intuition, with the highest sensitivity test favored. While it might be argued that F scores can be made consistent with utility by choosing a value of β different from 1, we found it is impossible to rationally prespecify β for any given clinical scenario, as even small changes in prevalence led to undesirable rank orderings for a given β being inconsistent with utility.ConclusionThe F score ranks diagnostic tests and prediction models inconsistently with their clinical utility. Moreover, the F score does not have an interpretable unit, does not allow for a comparison with a strategy assuming all are negative, and requires dichotomization of models. In contrast, standard decision-analytic measures – net benefit and decision curve analysis – allow rational and consistent choice of weighting, have an interpretable unit, can evaluate the strategy of assuming all are negative, and do not require dichotomization of continuous models. Consistent with TRIPOD AI we recommend that net benefit, alongside discrimination and calibration, be used for the evaluation of diagnostic tests and prediction models.Supplementary InformationThe online version contains supplementary material available at 10.1186/s41512-025-00214-7.
- Research Article
95
- 10.1186/s41512-017-0020-3
- Dec 1, 2017
- Diagnostic and Prognostic Research
BackgroundA variety of statistics have been proposed as tools to help investigators assess the value of diagnostic tests or prediction models. The Brier score has been recommended on the grounds that it is a proper scoring rule that is affected by both discrimination and calibration. However, the Brier score is prevalence dependent in such a way that the rank ordering of tests or models may inappropriately vary by prevalence.MethodsWe explored four common clinical scenarios: comparison of a highly accurate binary test with a continuous prediction model of moderate predictiveness; comparison of two binary tests where the importance of sensitivity versus specificity is inversely associated with prevalence; comparison of models and tests to default strategies of assuming that all or no patients are positive; and comparison of two models with miscalibration in opposite directions.ResultsIn each case, we found that the Brier score gave an inappropriate rank ordering of the tests and models. Conversely, net benefit, a decision-analytic measure, gave results that always favored the preferable test or model.ConclusionsBrier score does not evaluate clinical value of diagnostic tests or prediction models. We advocate, as an alternative, the use of decision-analytic measures such as net benefit.Trial registrationNot applicable.
- Discussion
2
- 10.1016/j.brs.2020.04.016
- Apr 28, 2020
- Brain stimulation
The clinical utility of imaging-defined biotypes of depression and transcranial magnetic stimulation: A decision curve analysis
- Research Article
64
- 10.1016/j.ebiom.2018.05.010
- Jun 1, 2018
- EBioMedicine
RankProd Combined with Genetic Algorithm Optimized Artificial Neural Network Establishes a Diagnostic and Prognostic Prediction Model that Revealed C1QTNF3 as a Biomarker for Prostate Cancer
- Research Article
56
- 10.1186/s12911-016-0336-x
- Jul 18, 2016
- BMC Medical Informatics and Decision Making
BackgroundRisk prediction models have been proposed for various diseases and are being improved as new predictors are identified. A major challenge is to determine whether the newly discovered predictors improve risk prediction. Decision curve analysis has been proposed as an alternative to the area under the curve and net reclassification index to evaluate the performance of prediction models in clinical scenarios. The decision curve computed using the net benefit can evaluate the predictive performance of risk models at a given or range of threshold probabilities. However, when the decision curves for 2 competing models cross in the range of interest, it is difficult to identify the best model as there is no readily available summary measure for evaluating the predictive performance. The key deterrent for using simple measures such as the area under the net benefit curve is the assumption that the threshold probabilities are uniformly distributed among patients.MethodsWe propose a novel measure for performing decision curve analysis. The approach estimates the distribution of threshold probabilities without the need of additional data. Using the estimated distribution of threshold probabilities, the weighted area under the net benefit curve serves as the summary measure to compare risk prediction models in a range of interest.ResultsWe compared 3 different approaches, the standard method, the area under the net benefit curve, and the weighted area under the net benefit curve. Type 1 error and power comparisons demonstrate that the weighted area under the net benefit curve has higher power compared to the other methods. Several simulation studies are presented to demonstrate the improvement in model comparison using the weighted area under the net benefit curve compared to the standard method.ConclusionsThe proposed measure improves decision curve analysis by using the weighted area under the curve and thereby improves the power of the decision curve analysis to compare risk prediction models in a clinical scenario.Electronic supplementary materialThe online version of this article (doi:10.1186/s12911-016-0336-x) contains supplementary material, which is available to authorized users.
- Research Article
3
- 10.1111/acem.15103
- Feb 6, 2025
- Academic Emergency Medicine
ObjectiveCalibration and discrimination indicators alone are insufficient for evaluating the clinical usefulness of prediction models, as they do not account for the cost of misclassification errors. This study aimed to modify the Geriatric Trauma Outcome Score (GTOS) and assess the clinical utility of the modified model using net benefit (NB) and decision curve analysis (DCA) for predicting in‐hospital mortality.MethodsThe Trauma Quality Improvement Program (TQIP) 2017 was used to identify geriatric trauma patients (≥ 65 years) treated at Level I trauma centers. The outcome of interest was in‐hospital mortality. The GTOS was modified to include additional patient, injury, and treatment characteristics identified through machine learning methods, focusing on early risk stratification. Calibration and discrimination indicators, along with NB and DCA, were utilized for evaluation.ResultsOf the 67,222 admitted geriatric trauma patients, 5.6% died in the hospital. The modified GTOS score included the following variables with associated weights: initial airway intervention (5), Glasgow Coma Scale ≤13 (5), packed red blood cell transfusion within 24 h (3), penetrating injury (2), age ≥ 75 years (2), preexisting comorbidity (1), and torso injury (1), with a total range from 0 to 19. The modified GTOS demonstrated a significantly higher area under the curve (0.92 vs. 0.84, p < 0.0001), lower misclassification error (4.9% vs. 5.2%), and lower Brier score (0.036 vs. 0.042) compared to the original GTOS. DCA showed that using the modified GTOS for predicting in‐hospital mortality resulted in higher NB than treating all, treating none, and treating based on the original GTOS across a wide range of clinician preferences.ConclusionsThe modified GTOS model exhibited superior predictive ability and clinical utility compared to the original GTOS. NB and DCA offer valuable complementary methods to calibration and discrimination indicators, comprehensively evaluating the clinical usefulness of prediction models and decision strategies.
- Research Article
1123
- 10.1186/1472-6947-8-53
- Nov 26, 2008
- BMC Medical Informatics and Decision Making
BackgroundDecision curve analysis is a novel method for evaluating diagnostic tests, prediction models and molecular markers. It combines the mathematical simplicity of accuracy measures, such as sensitivity and specificity, with the clinical applicability of decision analytic approaches. Most critically, decision curve analysis can be applied directly to a data set, and does not require the sort of external data on costs, benefits and preferences typically required by traditional decision analytic techniques.MethodsIn this paper we present several extensions to decision curve analysis including correction for overfit, confidence intervals, application to censored data (including competing risk) and calculation of decision curves directly from predicted probabilities. All of these extensions are based on straightforward methods that have previously been described in the literature for application to analogous statistical techniques.ResultsSimulation studies showed that repeated 10-fold crossvalidation provided the best method for correcting a decision curve for overfit. The method for applying decision curves to censored data had little bias and coverage was excellent; for competing risk, decision curves were appropriately affected by the incidence of the competing risk and the association between the competing risk and the predictor of interest. Calculation of decision curves directly from predicted probabilities led to a smoothing of the decision curve.ConclusionDecision curve analysis can be easily extended to many of the applications common to performance measures for prediction models. Software to implement decision curve analysis is provided.
- Research Article
2
- 10.1002/jso.27508
- Nov 16, 2023
- Journal of Surgical Oncology
The mutation status of rat sarcoma viral oncogene homolog (RAS) has prognostic significance and serves as a key predictive biomarker for the effectiveness of antiepidermal growth factor receptor (EGFR) therapy. However, there remains a lack of effective models for predicting RAS mutation status in colorectal liver metastases (CRLMs). This study aimed to construct and validate a diagnostic model for predicting RAS mutation status among patients undergoing hepatic resection for CRLMs. A diagnostic multivariate prediction model was developed and validated in patients with CRLMs who had undergone hepatectomy between 2014 and 2020. Patients from Institution A were assigned to the model development group (i.e., Development Cohort), while patients from Institutions B and C were assigned to the external validation groups (i.e., Validation Cohort_1 and Validation Cohort_2). The presence of CRLMs was determined by examination of surgical specimens. RAS mutation status was determined by genetic testing. The final predictors, identified by a group of oncologists and radiologists, included several key clinical, demographic, and radiographic characteristics derived from magnetic resonance images. Multiple imputation was performed to estimate the values of missing non-outcome data. A penalized logistic regression model using the adaptive least absolute shrinkage and selection operatorpenalty was implemented to select appropriate variables for the development of the model. A single nomogram was constructed from the model. The performance of the prediction model, discrimination,and calibration were estimated and reported by the area under the receiver operating characteristic curve (AUC) and calibration plots. Internal validation with a bootstrapping procedure and external validation of the nomogram were assessed. Finally, decision curve analyses were used to characterize the clinical outcomes of the Development and Validation Cohorts. A total of 173 patients were enrolled in this study between January 2014 and May 2020. Of the 173 patients, 117 patients from Institution A were assigned to the Model Development group, while 56 patients (33 from Institution B and 23 from Institution C) were assigned to the Model Validation groups. Forty-six (39.3%) patients harbored RAS mutations in the Development Cohort compared to 14 (42.4%) in Validation Cohort_1 and 8 (34.8%) in Validation Cohort_2. The final model contained the following predictor variables: time of occurrence of CRLMs, location of primary lesion, type of intratumoral necrosis, and early enhancement of liver parenchyma. The diagnostic model based on clinical and MRI data demonstrated satisfactory predictive performance in distinguishing between mutated and wild-type RAS, with AUCs of 0.742 (95% confidence interval [CI]: 0.651─0.834), 0.741 (95% CI: 0.649─0.836), 0.703 (95% CI: 0.514─0.892), and 0.708 (95% CI: 0.452─0.964) in the Development Cohort, bootstrapping internal validation, external Validation Cohort_1 and Validation Cohort_2, respectively. The Hosmer-Lemeshow goodness-of-fit values for the Development Cohort, Validation Cohort_1 and Validation Cohort_2 were 2.868 (p = 0.942), 4.616 (p = 0.465),and 6.297 (p = 0.391), respectively. Integrating clinical, demographic, and radiographic modalities with a magnetic resonance imaging-based approach may accurately predict the RAS mutation status of CRLMs, thereby aiding in triage and possibly reducing the time taken to perform diagnostic and life-saving procedures. Our diagnostic multivariate prediction model may serve as a foundation for prognostic stratification and therapeutic decision-making.
- Research Article
- 10.3389/fdgth.2025.1575320
- May 16, 2025
- Frontiers in Digital Health
IntroductionThe application of artificial intelligence in diagnostic prediction models for diseases and syndromes in Chinese Medicine (CM) has been rapidly expanding, accompanied by a significant increase in related research publications. However, existing reporting guidelines for diagnostic prediction models are primarily tailored to Western medicine, which differs fundamentally from CM in its theoretical framework, terminology, and classification systems. To address this gap, it is essential to establish a transparent and standardized reporting tool specifically designed for CM diagnostic and syndrome prediction models. This will enhance the transparency, reproducibility, and clinical relevance of research findings in this emerging field.MethodsThis study adopts a structured, multi-phase Delphi protocol. A core working group will first conduct a comprehensive review of published studies on CM diagnostic prediction models to develop an initial item pool for the Transparent Reporting Tool for AI-based Diagnostic Prediction Models of Disease and Syndrome in Chinese Medicine (TRAPODS-CM). Delphi questionnaires will then be distributed via email to a multidisciplinary panel of experts in CM, computer science, and evidence-based methodology who meet the inclusion criteria. The number of Delphi rounds will be determined by evaluating the active coefficient, expert authority, and expert consensus. Final consensus on the TRAPODS-CM checklist will be achieved through online meetings. The study will be governed by a Steering Committee, with the core working group responsible for implementation. After publication, the finalized checklist will be disseminated via multimedia platforms, seminars, and academic conferences to maximize its academic and clinical impact.Ethics and DisseminationThis project has received ethical approval from the National Natural Science Foundation of China (Grant No. 82374336) and the Institutional Review Board of Nanyang Technological University (IRB-2024-1007). The study findings will be disseminated through peer-reviewed publications.
- Research Article
12
- 10.1097/cm9.0000000000001989
- Mar 10, 2022
- Chinese Medical Journal
The prevalence of hypertension is high among Chinese adults, thus, identifying non-hypertensive individuals at high risk for intervention will help to improve the efficiency of primary prevention strategies. The cross-sectional data on 9699 participants aged 20 to 80 years were collected from the China National Health Survey in Gansu and Hebei provinces in 2016 to 2017, and they were nonrandomly split into the training set and validation set based on location. Multivariable logistic regression analysis was performed to develop the diagnostic prediction model, which was presented as a nomogram and a website with risk classification. Predictive performances of the model were evaluated using discrimination and calibration, and were further compared with a previously published model. Decision curve analysis was used to calculate the standardized net benefit for assessing the clinical usefulness of the model. The Lasso regression analysis identified the significant predictors of hypertension in the training set, and a diagnostic model was developed using logistic regression. A nomogram with risk classification was constructed to visualize the model, and a website ( https://chris-yu.shinyapps.io/hypertension_risk_prediction/ ) was developed to calculate the exact probabilities of hypertension. The model showed good discrimination and calibration, with the C-index of 0.789 (95% confidence interval [CI]: 0.768, 0.810) through internal validation and 0.829 (95% CI: 0.816, 0.842) through external validation. Decision curve analysis demonstrated that the model was clinically useful. The model had a higher area under receiver operating characteristic curves in training and validation sets compared with a previously published diagnostic model based on Northern China population. This study developed and validated a diagnostic model for hypertension prediction in Gansu Province. A nomogram and a website were developed to make the model conveniently used to facilitate the individualized prediction of hypertension in the general population of Han and Yugur.
- Research Article
1
- 10.1007/s42000-025-00634-6
- Feb 13, 2025
- Hormones (Athens, Greece)
Nonalcoholic fatty liver disease (NAFLD) is a multisystem disease that can trigger the metabolic syndrome. Early prevention and treatment of NAFLD is still a huge challenge for patients and clinicians. The aim of this study was to develop and validate machine learning (ML)-based predictive models. The model with optimal performance would be developed as a set of simple arithmetic tools for predicting the risk of NAFLD individually. Statistical analyses were performed in 2428 individuals extracted from the National Health and Nutrition Examination Survey (NHANES, cycle 2017-2020.3) database. Feature variables were selected by the least absolute shrinkage and selection operator (LASSO) regression. Seven ML algorithms, including logistic regression (LR), decision tree (DT), random forest (RF), extreme gradient boosting (XGB), K-nearest neighbor (KNN), light gradient boosting machine (LightGBM), and multilayer perceptron (MLP), were used to construct models based on the feature variables and evaluate their performance. The model with the best performance was transformed into a diagnostic predictive nomogram (DPN). The DPN was developed into an online calculator and an Excel algorithm tool. Receiver operating characteristic (ROC) curve, decision curve analysis (DCA), and subgroup analyses were used to compare and assess the predictive abilities of the DPN and six existing NAFLD predictive models, including the ZJU index, the hepatic steatosis index (HSI), the triglyceride-glucose index (TyG), the Framingham steatosis index (FSI), the fatty liver index (FLI), and the visceral adiposity index (VAI). Among the 2428 participants, the prevalence of NAFLD was 47.45%. LASSO regression identified eight variables from 39 variables, including body mass index (BMI), waist circumference (WC), alanine aminotransferase (ALT), triglyceride (TG), diabetes, hypertension, uric acid (UA), and race. Among the models constructed by the seven algorithms mentioned above, the LR-based model performed the best, demonstrating outstanding performance in terms of area under the curve (AUC, 0.823), accuracy (0.754), precision (0.768), specificity (0.804), and positive predictive value (0.768). It was then transformed into the DPN, which was successfully developed as an online calculator and an Excel algorithm tool. The diagnostic accuracy (AUC 0.856, 95% confidence interval (CI) 0.839-0.874, and AUC 0.823, 95% CI 0.793-0.854, respectively) and net clinical benefit of DPN in the training and validation sets were superior to those of the ZJU, HSI, TyG, FSI, FLI, and VAI. The results were maintained in subgroup analyses. The LR model based on ML was developed, exhibiting good performance. DPN can be used as an individualized tool for rapid detection of NAFLD.
- Research Article
105
- 10.1371/journal.pmed.1001886
- Oct 13, 2015
- PLOS Medicine
A fundamental part of medical research is the development and validation of diagnostic and prognostic prediction models [1,2]. These prediction models aim to predict the absolute probability that a certain disease or condition is currently present (diagnostic models) or that an outcome will occur within a specific follow-up period (prognostic models) for an individual subject. Prediction models typically rely on multiple predictors, which can include demographic characteristics, medical history and physical examination items, or more complex measurements from, for example, medical imaging, electrophysiology, pathology, and biomarkers. Also for diagnostic models, estimates of probabilities are rarely based on a single test, and doctors naturally integrate several patient characteristics and symptoms [3]. A broad range of prediction modeling techniques exist, like regression approaches, neural network models, decision tree models, genetic programming models, and support vector machine learning models, although prediction models developed by a multivariable regression approach are by far prevailing. It is widely recommended that a developed prediction model should not be used in practice before being externally validated—at least once—in other individuals than those used for model development [4–7]. Unfortunately, most prediction models are poorly or not at all validated, rendering interpretation of their generalizability difficult. In addition, many systematic reviews showed that for the same outcome or same target population, numerous competing models exist [8–10]. Generally speaking, researchers often ignore existing prediction models and develop yet another prediction model from their own data [2]. This practice sustains a cycle of underpowered prediction model development studies and poor knowledge about the generalizability and applicability of developed prediction models. Evidence synthesis and meta-analysis of individual participant data (IPD) from multiple studies seems to be a unique opportunity to address these problems, as it allows researchers to develop and directly validate models on large datasets and across a wide range of populations and settings, to directly test a model’s generalizability (Fig 1) [11–13]. Fig 1 Trends in publications of IPD-MA studies focusing on the development and/or validation of diagnostic or prognostic prediction models. There is currently little guidance on how to conduct an IPD meta-analysis (IPD-MA) for developing and/or validating diagnostic or prognostic prediction models [15]. To date, most IPD-MA articles focus on estimating relative quantities, like a risk ratio, hazard ratio, or odds ratio for a specific treatment or a specific etiologic factor. In contrast, prediction modeling research is focused on developing and validating multivariable models aimed at calculating an absolute risk estimate of the combined variables, rather than estimating the relative effect of a specific treatment or etiologic factor. Furthermore, prediction modeling studies focus entirely on the role and joint contribution of multiple covariates, whereas intervention studies in principle rely on randomization to reduce the role of covariates (Table 1). Hence, IPD-MAs of randomized intervention and etiological studies, which are beyond the scope of this paper and are instead addressed in the accompanying paper [16], differ from IPD-MAs of multivariable prediction models, which are the focus of this paper. Table 1 The main differences between IPD-MA of treatment intervention studies and of multivariable prediction modeling studies. We provide an overview of the advantages and limitations of IPD-MAs aiming to develop a novel prediction model or to validate one or more existing models across multiple datasets. This overview is based on published guidelines and existing recommendations for the conduct of prediction modeling studies and of IPD-MA research. We illustrate this overview with examples of recently published IPD-MAs of prediction models across various medical domains. Our aim is to help researchers, readers, reviewers, and editors to identify and understand the key issues involved with such IPD-MA projects.
- Research Article
7
- 10.1111/eci.12723
- Jan 28, 2017
- European Journal of Clinical Investigation
Decision curve analysis (DCA) is an increasingly used method for evaluating diagnostic tests and predictive models, but its application requires individual patient data. The Monte Carlo (MC) method can be used to simulate probabilities and outcomes of individual patients and offers an attractive option for application of DCA. We constructed a MC decision model to simulate individual probabilities of outcomes of interest. These probabilities were contrasted against the threshold probability at which a decision-maker is indifferent between key management strategies: treat all, treat none or use predictive model to guide treatment. We compared the results of DCA with MC simulated data against the results of DCA based on actual individual patient data for three decision models published in the literature: (i) statins for primary prevention of cardiovascular disease, (ii) hospice referral for terminally ill patients and (iii) prostate cancer surgery. The results of MC DCA and patient data DCA were identical. To the extent that patient data DCA were used to inform decisions about statin use, referral to hospice or prostate surgery, the results indicate that MC DCA could have also been used. As long as the aggregate parameters on distribution of the probability of outcomes and treatment effects are accurately described in the published reports, the MC DCA will generate indistinguishable results from individual patient data DCA. We provide a simple, easy-to-use model, which can facilitate wider use of DCA and better evaluation of diagnostic tests and predictive models that rely only on aggregate data reported in the literature.
- Research Article
1
- 10.1080/07853890.2024.2433677
- Nov 29, 2024
- Annals of Medicine
Background Bronchopulmonary dysplasia (BPD) is the most common chronic respiratory disease among preterm infants. Owing to the limitations in current diagnostic methods, developing a predictive model for BPD is crucial. Methods Using 243 autophagy-associated genes and dataset GSE32472, differential expression of autophagy-associated genes was identified at postnatal days 5, 14, and 28 between BPD patients and controls. LASSO and multivariate logistic regression analyses were performed to screen for diagnostic prediction genes. Receiver Operating Characteristic, Harrell’s concordance index, and decision curve analysis (DCA) were used to evaluate the diagnostic prediction model in GSE32472 and GSE220135. A BPD mouse model was constructed and qRT-PCR and Western blot were used to verify gene expression in lung tissue. Results Based on p < 0.05, we constructed a diagnostic prediction model for BPD using WIPI1, TOMM70A, BAG3, and PRKCQ. For the training database, the model’s C-index and Area under Curve were both 0.941, and a high applicability value was demonstrated by the DCA curve. These outcomes were also confirmed in the validation cohort GSE220135, demonstrating the superior diagnostic prediction capability of our approach. In addition, significant variations in immune cell infiltration were observed between BPD patients and controls. According to the results of qRT-PCR, BPD model mice had significantly lower expression levels of WIPI1, TOMM70A, BAG3, and PRKCQ than controls. Conclusions We constructed and validated a diagnostic prediction model for BPD based on WIPI1, TOMM70A, BAG3, and PRKCQ. These four genes may influence BPD development by regulating immune responses and immune cells.
- Research Article
2
- 10.1093/ckj/sfae038
- Feb 15, 2024
- Clinical Kidney Journal
ABSTRACTBackgroundVascular calcification (VC) commonly occurs and seriously increases the risk of cardiovascular events and mortality in patients with hemodialysis. For optimizing individual management, we will develop a diagnostic multivariable prediction model for evaluating the probability of VC.MethodsThe study was conducted in four steps. First, identification of miRNAs regulating osteogenic differentiation of vascular smooth muscle cells (VSMCs) in calcified condition. Second, observing the role of miR-129–3p on VC in vitro and the association between circulating miR-129–3p and VC in hemodialysis patients. Third, collecting all indicators related to VC as candidate variables, screening predictors from the candidate variables by Lasso regression, developing the prediction model by logistic regression and showing it as a nomogram in training cohort. Last, verifying predictive performance of the model in validation cohort.ResultsIn cell experiments, miR-129–3p was found to attenuate vascular calcification, and in human, serum miR-129–3p exhibited a negative correlation with vascular calcification, suggesting that miR-129–3p could be one of the candidate predictor variables. Regression analysis demonstrated that miR-129–3p, age, dialysis duration and smoking were valid factors to establish the prediction model and nomogram for VC. The area under receiver operating characteristic curve of the model was 0.8698. The calibration curve showed that predicted probability of the model was in good agreement with actual probability and decision curve analysis indicated better net benefit of the model. Furthermore, internal validation through bootstrap process and external validation by another independent cohort confirmed the stability of the model.ConclusionWe build a diagnostic prediction model and present it as an intuitive tool based on miR-129–3p and clinical indicators to evaluate the probability of VC in hemodialysis patients, facilitating risk stratification and effective decision, which may be of great importance for reducing the risk of serious cardiovascular events.
- Research Article
40
- 10.1148/radiol.2021204093
- Aug 31, 2021
- Radiology
Background Gallium 68 (68Ga) prostate-specific membrane antigen (PSMA) PET/MRI may improve detection of clinically significant prostate cancer (CSPC). Purpose To compare the sensitivity and specificity of 68Ga-PSMA PET/MRI with multiparametric MRI for detecting CSPC. Materials and Methods Men with prostate specific antigen levels of 2.5-20 ng/mL prospectively underwent 68Ga-PSMA PET/MRI, including multiparametric MRI sequences, between June 2019 and March 2020. Imaging was evaluated independently by two radiologists by using the Prostate Imaging Reporting and Data System (PI-RADS) version 2.1. Sensitivity and specificity for CSPC (International Society of Urological Pathology grade group ≥ 2) were compared for 68Ga-PSMA PET/MRI and multiparametric MRI by using the McNemar test. Decision curve analysis compared the net benefit of each imaging strategy. Results Ninety-nine men (median age, 67 years; interquartile range, 62-71 years) were included; 79% (78 of 99) underwent biopsy. CSPC was detected in 32% (25 of 78). For CSPC, specificity was higher for 68Ga-PSMA PET/MRI than multiparametric MRI (76% [95% CI: 62, 86] vs 49% [95% CI: 35, 63], respectively; P < .001). Sensitivity was similar (88% [95% CI: 69, 98] vs 92% [95% CI: 74, 99], respectively; P > .99). For PI-RADS 3 lesions, specificity was also higher for 68Ga-PSMA PET/MRI than for multiparametric MRI: 86% (95% CI: 73, 95) versus 59% (95% CI: 43, 74), respectively (P = .002). Decision curve analysis showed that biopsies targeted to PSMA uptake increased the net benefit of multiparametric MRI only among PI-RADS 3 lesions. The net benefit of targeted biopsy for a PI-RADS 3 lesion with PSMA uptake was higher across all threshold probabilities over 8%. The net benefit of targeted biopsy was similar for PI-RADS 4 and 5 lesions, regardless of PSMA uptake. Conclusions Gallium 68 prostate-specific membrane antigen PET/MRI improved specificity for clinically significant prostate cancer compared with multiparametric MRI, particularly in Prostate Imaging Reporting and Data System grade 3 lesions. © RSNA, 2021 Online supplemental material is available for this article. See also the editorial by Williams and Estes in this issue.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.