Artificial Intelligence computed tomography models for the discrimination of Wilms versus non-Wilms tumors: systematic review and meta-analysis

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Objective:To conduct a systematic review and meta-analysis to evaluate the effectiveness of artificial intelligence (AI) models aimed at identify Wilms tumor on computed tomography (CT) scans.Methods:A search was carried out across MEDLINE, Embase, Web of Science, and Cochrane databases in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines. Diagnostic studies using AI-based CT to diagnose Wilms tumor were included if they reported sensitivity, specificity, and AUC. Studies with incomplete data or lacking full-text availability were excluded. Statistical analysis was conducted in R (v4.3.3) using a random-effects model, with logit transformation for univariate analysis and SROC curve construction for bivariate analysis. Heterogeneity (I2 ≥ 40%) was assessed and explored via sensitivity analysis.Results:The analysis included four studies (three studies from China and one from Turkey) with 177 patients with Wilms tumors and 62 without Wilms tumors. The combined analysis of all models demonstrated a sensitivity of 63.9% (95% CI: 0.533–0.734), a specificity of 82.8% (95% CI: 0.716–0.902), and an area under the curve (AUC) of 0.831 (95% CI: 0.607–0.883).Conclusion:This study demonstrated that AI models exhibit moderate sensitivity and high specificity to identify Wilms tumor on CT scans, with an overall AUC of 0.831. These results underscore the promise of AI as a supportive tool in diagnostic imaging, although the limited number of studies and notable methodological heterogeneity warrant cautious interpretation and reinforce the need for validation in larger, more representative populations.

Similar Papers
  • Research Article
  • 10.1590/2175-8239-jbn-2025-0010en
Artificial Intelligence computed tomography models for the discrimination of Wilms versus non-Wilms tumors: systematic review and meta-analysis.
  • Mar 1, 2026
  • Jornal brasileiro de nefrologia
  • Helvécio Neves Feitosa Filho + 8 more

To conduct a systematic review and meta-analysis to evaluate the effectiveness of artificial intelligence (AI) models aimed at identify Wilms tumor on computed tomography (CT) scans. A search was carried out across MEDLINE, Embase, Web of Science, and Cochrane databases in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines. Diagnostic studies using AI-based CT to diagnose Wilms tumor were included if they reported sensitivity, specificity, and AUC. Studies with incomplete data or lacking full-text availability were excluded. Statistical analysis was conducted in R (v4.3.3) using a random-effects model, with logit transformation for univariate analysis and SROC curve construction for bivariate analysis. Heterogeneity (I2 ≥ 40%) was assessed and explored via sensitivity analysis. The analysis included four studies (three studies from China and one from Turkey) with 177 patients with Wilms tumors and 62 without Wilms tumors. The combined analysis of all models demonstrated a sensitivity of 63.9% (95% CI: 0.533-0.734), a specificity of 82.8% (95% CI: 0.716-0.902), and an area under the curve (AUC) of 0.831 (95% CI: 0.607-0.883). This study demonstrated that AI models exhibit moderate sensitivity and high specificity to identify Wilms tumor on CT scans, with an overall AUC of 0.831. These results underscore the promise of AI as a supportive tool in diagnostic imaging, although the limited number of studies and notable methodological heterogeneity warrant cautious interpretation and reinforce the need for validation in larger, more representative populations.

  • Research Article
  • 10.1097/corr.0000000000003660
Are Artificial Intelligence Models Reliable for Clinical Application in Pediatric Fracture Detection on Radiographs? A Systematic Review and Meta-analysis.
  • Aug 20, 2025
  • Clinical orthopaedics and related research
  • Gabriel Fontenele Ximenes + 6 more

Artificial intelligence (AI) applications for pediatric fracture diagnosis using radiographs have demonstrated growing potential in clinical settings. Despite this growing potential, existing studies are limited by small sample sizes, variability in their diagnostic metrics, and inconsistent use of external validation, which reduces confidence in their findings. These limitations hinder the assessment of real-world performance. A meta-analysis would help address these gaps by pooling data to generate more robust, generalizable estimates for clinical application and future guidance. (1) What is the pooled diagnostic performance of AI models, including sensitivity, specificity, and area under the curve (AUC), for detecting pediatric fractures on radiographs? (2) What is the clinical applicability of AI models, as determined by whether their diagnostic performance is sustained in studies that employed external validation? (3) How does anatomic coverage influence the diagnostic performance of AI models? This meta-analysis adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines and was registered in PROSPERO (CRD42024628342). A systematic search of PubMed/MEDLINE, Embase, and the Cochrane Library was conducted from database inception through December 9, 2024. A total of 497 records were identified. Eligible studies included pediatric patients with suspected fractures evaluated by AI models on radiographs. Studies were excluded if they lacked sufficient data to calculate sensitivity, specificity, or AUC; if they combined adult and pediatric populations; or if they focused on rib fractures. Sixteen diagnostic accuracy studies were included, involving 10,203 pediatric patients with a mean age of 8.85 years, 54% of whom were male, and 21,789 radiographs, of which 5882 confirmed fractures. Data extraction followed the Population, Index test, Target condition (PIT) framework and was performed independently by two reviewers. The risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool, which evaluates four domains (patient selection, index test, reference standard, and flow/timing) for low, high, or unclear risk. Most studies exhibited low to moderate risk of bias. Certainty of evidence was evaluated using the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach, which classifies evidence as high, moderate, low, or very low, and in this study demonstrated high certainty of evidence. Heterogeneity in the pooled estimates was moderate for sensitivity (I2 = 61%) and high for specificity (I2 = 90%). No evidence of publication bias was detected based on Egger test (p = 0.54) and funnel plot symmetry. Meta-analyses used logit transformation and bivariate modeling to estimate pooled sensitivity, specificity, and AUC. The pooled analysis demonstrated a sensitivity of 93% (95% confidence interval [CI] 92% to 94%), a specificity of 91% (95% CI 88% to 93%), and an AUC of 0.96 (95% CI 0.92 to 0.97). The AUC reflects the overall ability of a model to distinguish between patients with and without fractures, with values closer to 1.0 indicating better diagnostic performance. When evaluated on external data sets, AI models maintained high diagnostic accuracy, with a sensitivity of 93% (95% CI 90% to 95%), specificity of 88% (95% CI 84% to 91%), and an AUC of 0.95 (95% CI 0.89 to 0.97), supporting their potential for clinical applicability. Anatomic coverage by specific region made a meaningful contribution to explaining the observed heterogeneity. Models evaluating multiple regions showed slightly higher sensitivity, while those focused on single regions demonstrated better specificity, suggesting that a broader anatomic scope may improve fracture detection but slightly reduce accuracy in ruling out false positives. This meta-analysis demonstrates that AI models can accurately detect pediatric fractures on radiographs, a finding that withstood scrutiny in studies that included external validation. These findings suggest that orthopaedic surgeons and emergency physicians can consider incorporating validated convolutional neural network algorithms into workflows to enhance diagnostic accuracy, especially in acute care settings where rapid and accurate decision-making is critical. Nevertheless, future research is needed to investigate performance across specific subgroups, including sex and anatomic regions. Paired-design diagnostic accuracy studies with external geographic validation remain the most appropriate method to assess their real-world value. Such validation should be prioritized as a prerequisite for clinical generalization and democratization of AI models, even before randomized trials or prospective implementation studies. Level III, diagnostic study.

  • Preprint Article
  • 10.2196/preprints.78306
Diagnostic Performance of CT-Based Artificial Intelligence for Early Recurrence of Cholangiocarcinoma: A Systematic Review and Meta-Analysis (Preprint)
  • May 30, 2025
  • Jie Chen + 5 more

BACKGROUND Despite AI models showing high predictive accuracy for early CCA recurrence, their clinical use faces challenges like reproducibility, generalizability, hidden biases, and uncertain performance across diverse datasets and populations, raising concerns about practical applicability. OBJECTIVE This meta-analysis seeks to systematically assess the diagnostic performance of artificial intelligence (AI) models utilizing computed tomography (CT) imaging for predicting the early recurrence of cholangiocarcinoma (CCA). METHODS A systematic search was conducted in PubMed, Embase, and Web of Science were performed for studies published up to April 2025, focusing on the ability of CT-based AI to predict early recurrence of CCA. Heterogeneity was evaluated using the I² statistic, and data were pooled using a bivariate random-effects model. Methodological quality was assessed with an optimized version of the revised QUADAS-2 tool. RESULTS Nine studies with 30 datasets involving 1,537 patients were included. In internal validation cohorts, CT-based AI models showed a pooled sensitivity of 0.87 (95% CI: 0.81–0.92), specificity of 0.85 (95% CI: 0.79–0.89), diagnostic odds ratio (DOR) of 37.71 (95% CI: 18.35–77.51), and area under the curve (AUC) of 0.93 (95% CI: 0.90–0.94). In external validation cohorts, the pooled sensitivity was 0.87 (95% CI: 0.81–0.91), specificity was 0.82 (95% CI: 0.77–0.86), DOR was 30.81 (95% CI: 18.79–50.52), and AUC was 0.85 (95% CI: 0.82–0.88). The AUC was significantly lower in external validation than in internal validation (P < 0.001). CONCLUSIONS Our results show that CT-based AI models predict early CCA recurrence with high performance in internal validation sets and moderate performance in external validation sets. Future research should focus on prospective designs and establishing standardized gold standards to further validate the clinical applicability and generalization value of AI models.

  • Research Article
  • 10.1016/j.lungcan.2025.108577
Artificial intelligence in predicting EGFR mutations from whole slide images in lung Cancer: A systematic review and Meta-Analysis.
  • Jun 1, 2025
  • Lung cancer (Amsterdam, Netherlands)
  • Mai Hanh Nguyen + 3 more

Artificial intelligence in predicting EGFR mutations from whole slide images in lung Cancer: A systematic review and Meta-Analysis.

  • Research Article
  • Cite Count Icon 32
  • 10.1542/pir.34-7-328
Wilms Tumor
  • Jul 1, 2013
  • Pediatrics in Review
  • A D Friedman

Wilms Tumor

  • Research Article
  • Cite Count Icon 2
  • 10.21873/anticanres.17414
Artificial Intelligence Models Could Enhance the Diagnostic Accuracy (DA) of Fecal Immunochemical Test (FIT) in the Detection of Colorectal Adenoma in a Screening Setting.
  • Dec 30, 2024
  • Anticancer research
  • Maaret Eskelinen + 5 more

This study evaluated the diagnostic accuracy (DA) for colorectal adenomas (CRA), screened by fecal immunochemical test (FIT), using five artificial intelligence (AI) models: logistic regression (LR), support vector machine (SVM), neural network (NN), random forest (RF), and gradient boosting machine (GBM). These models were tested together with clinical features categorized as low-risk (lowR) and high-risk (highR). The colorectal neoplasia (CRN) screening cohort of 5,090 patients included 222 CRA patients and 264 non-CRA patients. Three consecutive fecal samples from each individual were analyzed by two fecal occult blood (FOB) assays. Five AI models including clinical features of CRN patients and CV test results were used to test the DA for CRA measured by receiving operating characteristic (ROC) curves. In conventional ROC analysis, the area under the curve (AUC) values for different AI models ranged from 0.659 and 0.691 (for AIs with LR and SVM), while the highest AUC values were reached by NN, RF, and GBM models (0.809, 0.840, and 0.858, respectively). In the hierarchical summary ROC (HSROC) analysis, the AUC values were as follows: i) with lowR variables, AUC=0.508; ii) with highR variables, AUC=0.566 and iii) with all AI models, AUC= 0.789. The differences in AUC values were: between i) and ii) p=0.008; between i) and iii) p<0.0001 and between ii) and iii) p<0.0001. In detection of CRA, the AI models proved to be superior to the diagnostic features without AI. This is the first study to report that DA in the diagnosis of CRA can be enhanced by AI models that include clinical data of the patients and results of FIT test.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 11
  • 10.1371/journal.pone.0288631
Artificial intelligence for detecting temporomandibular joint osteoarthritis using radiographic image data: A systematic review and meta-analysis of diagnostic test accuracy.
  • Jul 14, 2023
  • PLOS ONE
  • Liang Xu + 4 more

In this review, we assessed the diagnostic efficiency of artificial intelligence (AI) models in detecting temporomandibular joint osteoarthritis (TMJOA) using radiographic imaging data. Based upon the PRISMA guidelines, a systematic review of studies published between January 2010 and January 2023 was conducted using PubMed, Web of Science, Scopus, and Embase. Articles on the accuracy of AI to detect TMJOA or degenerative changes by radiographic imaging were selected. The characteristics and diagnostic information of each article were extracted. The quality of studies was assessed by the QUADAS-2 tool. Pooled data for sensitivity, specificity, and summary receiver operating characteristic curve (SROC) were calculated. Of 513 records identified through a database search, six met the inclusion criteria and were collected. The pooled sensitivity, specificity, and area under the curve (AUC) were 80%, 90%, and 92%, respectively. Substantial heterogeneity between AI models mainly arose from imaging modality, ethnicity, sex, techniques of AI, and sample size. This article confirmed AI models have enormous potential for diagnosing TMJOA automatically through radiographic imaging. Therefore, AI models appear to have enormous potential to diagnose TMJOA automatically using radiographic images. However, further studies are needed to evaluate AI more thoroughly.

  • Research Article
  • 10.70749/ijbr.v3i5.1294
Artificial Intelligence in Predicting Pregnancy Complications: A Systematic Review and Meta-Analysis of Preeclampsia and Gestational Diabetes Mellitus
  • May 15, 2025
  • Indus Journal of Bioscience Research
  • Salma Malik + 3 more

This systematic review and meta-analysis evaluates the performance of artificial intelligence (AI) models in predicting two major pregnancy complications: preeclampsia and gestational diabetes mellitus (GDM). Adhering to PRISMA guidelines, we analyzed 13 studies from PubMed, Scopus, Web of Science, and IEEE Xplore, selected from an initial pool of 2,163 articles. Using R software (version 4.3.1), we conducted a random-effects meta-analysis, assessing metrics such as the area under the curve (AUC), sensitivity, specificity, and accuracy. The study demonstrated strong predictive performance for preeclampsia and gestational diabetes mellitus (GDM) using artificial intelligence (AI) models. For preeclampsia prediction, the training area under the curve (AUC) was 0.878, while the test AUC was 0.861. Similarly, for GDM, the training AUC was 0.779, and the test AUC was 0.800, indicating high discriminative ability. Tree-based and neural network models outperformed other approaches, particularly when incorporating multimodal data—such as clinical and biochemical data or electronic health records (EHR). Sensitivity analysis further supported these findings, even after excluding high-risk studies identified by the PROBAST tool. While AI models show promise for antenatal risk screening, challenges remain, including limited external validation and interpretability. Future research should focus on improving model transparency, ensuring diverse ethnic representation, and facilitating seamless integration into clinical practice. These steps are critical to harnessing AI's potential for enhancing maternal and fetal health outcomes.

  • Research Article
  • 10.1200/jco.2025.43.16_suppl.e16184
Accuracy of artificial intelligence models integrating machine learning and deep learning in detecting microvascular invasion in liver cancer: A systematic review and meta-analysis.
  • Jun 1, 2025
  • Journal of Clinical Oncology
  • Minh Huu Nhat Le + 12 more

e16184 Background: Hepatocellular carcinoma (HCC) is a global health challenge, ranking sixth in incidence and third in cancer-related mortality. Microvascular invasion (MVI) is a crucial prognostic marker influencing recurrence rates and survival. Accurate preoperative MVI detection can guide surgical planning but is limited by invasive histopathological exams and interobserver variability. This study evaluates the diagnostic performance of artificial intelligence (AI) models, including machine learning (ML) and deep learning (DL), in predicting MVI in HCC using imaging modalities such as CT, MRI, ultrasound, PET/CT, and histopathology. Methods: A systematic review and meta-analysis followed PRISMA 2020 guidelines, covering studies from 2010 to 2023. Comprehensive searches were conducted across PubMed, Scopus, Web of Science, Embase, Cochrane Library, Google Scholar, European PMC, and BioMed Central. AI models were assessed for diagnostic accuracy using QUADAS-2 and Radiomics Quality Score (RQS). Metrics like the area under the curve (AUC), sensitivity, and specificity were analyzed. Results: This meta-analysis synthesized data from 51 studies, encompassing 6,257 records. DL models showed a pooled AUC of 0.84 (95% CI: 0.80–0.86), with sensitivity and specificity of 0.79 (95% CI: 0.75–0.82) and 0.84 (95% CI: 0.79–0.88), respectively. ML models achieved a pooled AUC of 0.83 (95% CI: 0.80–0.86), sensitivity of 0.79 (95% CI: 0.71–0.85), and higher specificity of 0.88 (95% CI: 0.84–0.92). Across imaging modalities, MRI and CT-based models achieved pooled AUCs of 0.87 and 0.83 for DL and 0.85 and 0.82 for ML, respectively. Ultrasound-based models demonstrated higher specificity but slightly lower sensitivity. Models incorporating clinical features did not outperform purely radiomics-based approaches. Quality assessments revealed low bias risks in patient selection (88%), index tests (94%), and reference standards (98%). However, only 51% of studies addressed inter-scanner variability, and 55% incorporated calibration or resampling. The mean RQS was 40%, with 84% adhering to robust imaging protocols. Conclusions: AI models, particularly DL, exhibit robust accuracy in predicting MVI in HCC, showing promise for integration into clinical workflows. These tools could enable personalized preoperative planning, improving patient outcomes and reducing recurrence risks. Standardized protocols, prospective validation, and broader adoption of advanced AI methods are needed to ensure consistent clinical utility and cost-effectiveness.

  • Research Article
  • 10.18502/fbt.v12i3.19190
Prognosis of COVID-19 Using Artificial Intelligence: A Systematic Review and Meta-Analysis
  • Jul 20, 2025
  • Frontiers in Biomedical Technologies
  • Saeed Reza Motamedian + 9 more

Purpose: Artificial Intelligence (AI) techniques have been extensively utilized for diagnosing and prognosing several diseases in recent years. This study identifies, appraises, and synthesizes published studies on the use of AI for the prognosis of COVID-19. Materials and Methods: Electronic search was performed using Medline, Google Scholar, Scopus, Embase, Cochrane, and ProQuest. The systematic approach followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines to ensure comprehensive reporting. Studies that examined machine learning or deep learning methods to determine the prognosis of COVID-19 using Computed Tomography (CT) or chest X-Ray (CXR) images were included. Polled sensitivity, specificity, accuracy, Area Under the Curve (AUC), and diagnostic odds ratio were calculated. Results: A total of 36 articles were included; various prognosis-related issues, including disease severity, mechanical ventilation, or admission to the intensive care unit, and mortality, were investigated. Several AI models and architectures were employed, such as the Siamense model, support vector machine, Random Forest, Extreme Gradient Boosting, and convolutional neural networks. The models achieved 71%, 88%, and 67% sensitivity for mortality, severity assessment, and need for ventilation, respectively. The specificities of 69%, 89%, and 89% were reported for the aforementioned variables. Conclusion: Based on the included articles, machine learning and deep learning methods used for COVID-19 patients' prognosis using radiomic features from CT or CXR images can help clinicians manage patients and allocate resources more effectively. These studies also demonstrate that combining patient demographics, clinical data, laboratory tests, and radiomic features improves model performance.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.jacr.2021.06.025
Real-World Surveillance of FDA-Cleared Artificial Intelligence Models: Rationale and Logistics.
  • Feb 1, 2022
  • Journal of the American College of Radiology
  • Keith J Dreyer + 2 more

Real-World Surveillance of FDA-Cleared Artificial Intelligence Models: Rationale and Logistics.

  • Research Article
  • Cite Count Icon 14
  • 10.3233/thc-220501
Assessment of artificial intelligence-aided reading in the detection of nasal bone fractures.
  • May 12, 2023
  • Technology and Health Care
  • Cun Yang + 4 more

Artificial intelligence (AI) technology is a promising diagnostic adjunct in fracture detection. However, few studies describe the improvement of clinicians' diagnostic accuracy for nasal bone fractures with the aid of AI technology. This study aims to determine the value of the AI model in improving the diagnostic accuracy for nasal bone fractures compared with manual reading. A total of 252 consecutive patients who had undergone facial computed tomography (CT) between January 2020 and January 2021 were enrolled in this study. The presence or absence of a nasal bone fracture was determined by two experienced radiologists. An AI algorithm based on the deep-learning algorithm was engineered, trained and validated to detect fractures on CT images. Twenty readers with various experience were invited to read CT images with or without AI. The accuracy, sensitivity and specificity with the aid of the AI model were calculated by the readers. The deep-learning AI model had 84.78% sensitivity, 86.67% specificity, 0.857 area under the curve (AUC) and a 0.714 Youden index in identifying nasal bone fractures. For all readers, regardless of experience, AI-aided reading had higher sensitivity ([94.00 ± 3.17]% vs [83.52 ± 10.16]%, P< 0.001), specificity ([89.75 ± 6.15]% vs [77.55 ± 11.38]%, P< 0.001) and AUC (0.92 ± 0.04 vs 0.81 ± 0.10, P< 0.001) compared with reading without AI. With the aid of AI, the sensitivity, specificity and AUC were significantly improved in readers with 1-5 years or 6-10 years of experience (all P< 0.05, Table4). For readers with 11-15 years of experience, no evidence suggested that AI could improve sensitivity and AUC (P= 0.124 and 0.152, respectively). The AI model might aid less experienced physicians and radiologists in improving their diagnostic performance for the localisation of nasal bone fractures on CT images.

  • Supplementary Content
  • 10.7759/cureus.85884
Artificial Intelligence in Ultrasound-Based Diagnoses of Gynecological Tumors: A Systematic Review
  • Jun 12, 2025
  • Cureus
  • Fatima Siddig Abdalla Mohammed + 7 more

Gynecological tumors, particularly ovarian, endometrial, and uterine masses, pose significant diagnostic challenges due to their heterogeneity and the subjective nature of ultrasound interpretation. Artificial intelligence (AI) has emerged as a promising tool to enhance diagnostic accuracy, yet its clinical adoption remains limited. This systematic review synthesizes evidence on AI applications in ultrasound-based diagnosis of gynecological tumors, evaluating performance metrics, methodological strengths, and limitations to guide future research and clinical implementation. Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines, a comprehensive search was conducted across PubMed, Excerpta Medica Database (Embase), Institute of Electrical and Electronics Engineers Xplore (IEEE Xplore), Scopus, and Web of Science, yielding 252 records. After removing duplicates and screening titles/abstracts, 106 studies were assessed, with 26 meeting inclusion criteria. Eligible studies investigated AI models for gynecological tumor diagnosis using ultrasound. Data were extracted on study design, sample size, AI methodology, performance metrics, and clinical applicability. Risk of bias was assessed using Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2). Narrative synthesis was performed due to methodological heterogeneity. The 26 included studies demonstrated strong diagnostic performance, with AI models achieving accuracies of 75-99.8% and area under the curve (AUCs) up to 0.99 in differentiating benign from malignant tumors. Deep learning architectures (e.g., convolutional neural networks (CNNs), residual neural networks (ResNet)) outperformed traditional machine learning in most studies, particularly when integrating radiomics with clinical variables (e.g., cancer antigen 125 (CA-125)). However, heterogeneity in imaging protocols, sample sizes, and validation methods limited comparability. Only three studies employed prospective designs, and few addressed algorithmic bias or real-world clinical integration. AI shows significant potential to improve ultrasound-based diagnosis of gynecological tumors, offering superior accuracy and reproducibility compared to conventional methods. However, standardized imaging protocols, robust external validation, and prospective trials are needed to translate these tools into clinical practice. Future work should prioritize explainable AI, diverse datasets, and outcome studies to ensure equitable and effective implementation.

  • Research Article
  • Cite Count Icon 70
  • 10.1097/corr.0000000000001685
Can a Deep-learning Model for the Automated Detection of Vertebral Fractures Approach the Performance Level of Human Subspecialists?
  • Feb 26, 2021
  • Clinical orthopaedics and related research
  • Yi-Chu Li + 5 more

Vertebral fractures are the most common osteoporotic fractures in older individuals. Recent studies suggest that the performance of artificial intelligence is equal to humans in detecting osteoporotic fractures, such as fractures of the hip, distal radius, and proximal humerus. However, whether artificial intelligence performs as well in the detection of vertebral fractures on plain lateral spine radiographs has not yet been reported. (1) What is the accuracy, sensitivity, specificity, and interobserver reliability (kappa value) of an artificial intelligence model in detecting vertebral fractures, based on Genant fracture grades, using plain lateral spine radiographs compared with values obtained by human observers? (2) Do patients' clinical data, including the anatomic location of the fracture (thoracic or lumbar spine), T-score on dual-energy x-ray absorptiometry, or fracture grade severity, affect the performance of an artificial intelligence model? (3) How does the artificial intelligence model perform on external validation? Between 2016 and 2018, 1019 patients older than 60 years were treated for vertebral fractures in our institution. Seventy-eight patients were excluded because of missing CT or MRI scans (24% [19]), poor image quality in plain lateral radiographs of spines (54% [42]), multiple myeloma (5% [4]), and prior spine instrumentation (17% [13]). The plain lateral radiographs of 941 patients (one radiograph per person), with a mean age of 76 ± 12 years, and 1101 vertebral fractures between T7 and L5 were retrospectively evaluated for training (n = 565), validating (n = 188), and testing (n = 188) of an artificial intelligence deep-learning model. The gold standard for diagnosis (ground truth) of a vertebral fracture is the interpretation of the CT or MRI reports by a spine surgeon and a radiologist independently. If there were any disagreements between human observers, the corresponding CT or MRI images would be rechecked by them together to reach a consensus. For the Genant classification, the injured vertebral body height was measured in the anterior, middle, and posterior third. Fractures were classified as Grade 1 (< 25%), Grade 2 (26% to 40%), or Grade 3 (> 40%). The framework of the artificial intelligence deep-learning model included object detection, data preprocessing of radiographs, and classification to detect vertebral fractures. Approximately 90 seconds was needed to complete the procedure and obtain the artificial intelligence model results when applied clinically. The accuracy, sensitivity, specificity, interobserver reliability (kappa value), receiver operating characteristic curve, and area under the curve (AUC) were analyzed. The bootstrapping method was applied to our testing dataset and external validation dataset. The accuracy, sensitivity, and specificity were used to investigate whether fracture anatomic location or T-score in dual-energy x-ray absorptiometry report affected the performance of the artificial intelligence model. The receiver operating characteristic curve and AUC were used to investigate the relationship between the performance of the artificial intelligence model and fracture grade. External validation with a similar age population and plain lateral radiographs from another medical institute was also performed to investigate the performance of the artificial intelligence model. The artificial intelligence model with ensemble method demonstrated excellent accuracy (93% [773 of 830] of vertebrae), sensitivity (91% [129 of 141]), and specificity (93% [644 of 689]) for detecting vertebral fractures of the lumbar spine. The interobserver reliability (kappa value) of the artificial intelligence performance and human observers for thoracic and lumbar vertebrae were 0.72 (95% CI 0.65 to 0.80; p < 0.001) and 0.77 (95% CI 0.72 to 0.83; p < 0.001), respectively. The AUCs for Grades 1, 2, and 3 vertebral fractures were 0.919, 0.989, and 0.990, respectively. The artificial intelligence model with ensemble method demonstrated poorer performance for discriminating normal osteoporotic lumbar vertebrae, with a specificity of 91% (260 of 285) compared with nonosteoporotic lumbar vertebrae, with a specificity of 95% (222 of 234). There was a higher sensitivity 97% (60 of 62) for detecting osteoporotic (dual-energy x-ray absorptiometry T-score ≤ -2.5) lumbar vertebral fractures, implying easier detection, than for nonosteoporotic vertebral fractures (83% [39 of 47]). The artificial intelligence model also demonstrated better detection of lumbar vertebral fractures compared with detection of thoracic vertebral fractures based on the external dataset using various radiographic techniques. Based on the dataset for external validation, the overall accuracy, sensitivity, and specificity on bootstrapping method were 89%, 83%, and 95%, respectively. The artificial intelligence model detected vertebral fractures on plain lateral radiographs with high accuracy, sensitivity, and specificity, especially for osteoporotic lumbar vertebral fractures (Genant Grades 2 and 3). The rapid reporting of results using this artificial intelligence model may improve the efficiency of diagnosing vertebral fractures. The testing model is available at http://140.113.114.104/vght_demo/corr/. One or multiple plain lateral radiographs of the spine in the Digital Imaging and Communications in Medicine format can be uploaded to see the performance of the artificial intelligence model. Level II, diagnostic study.

  • Discussion
  • Cite Count Icon 1
  • 10.1097/cm9.0000000000002305
Detection of metastasis of mediastinal lymph nodes in lung cancer patients with an artificial intelligence model.
  • May 5, 2023
  • Chinese Medical Journal
  • Xiao Sun + 8 more

Detection of metastasis of mediastinal lymph nodes in lung cancer patients with an artificial intelligence model.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.