Radiology Resident Performance Research Articles

Background Differentiating between benign and malignant vertebral fractures poses diagnostic challenges. Purpose To investigate the reliability of CT-based deep learning models to differentiate between benign and malignant vertebral fractures. Materials and Methods CT scans acquired in patients with benign or malignant vertebral fractures from June 2005 to December 2022 at two university hospitals were retrospectively identified based on a composite reference standard that included histopathologic and radiologic information. An internal test set was randomly selected, and an external test set was obtained from an additional hospital. Models used a three-dimensional U-Net encoder-classifier architecture and applied data augmentation during training. Performance was evaluated using the area under the receiver operating characteristic curve (AUC) and compared with that of two residents and one fellowship-trained radiologist using the DeLong test. Results The training set included 381 patients (mean age, 69.9 years ± 11.4 [SD]; 193 male) with 1307 vertebrae (378 benign fractures, 447 malignant fractures, 482 malignant lesions). Internal and external test sets included 86 (mean age, 66.9 years ± 12; 45 male) and 65 (mean age, 68.8 years ± 12.5; 39 female) patients, respectively. The better-performing model of two training approaches achieved AUCs of 0.85 (95% CI: 0.77, 0.92) in the internal and 0.75 (95% CI: 0.64, 0.85) in the external test sets. Including an uncertainty category further improved performance to AUCs of 0.91 (95% CI: 0.83, 0.97) in the internal test set and 0.76 (95% CI: 0.64, 0.88) in the external test set. The AUC values of residents were lower than that of the best-performing model in the internal test set (AUC, 0.69 [95% CI: 0.59, 0.78] and 0.71 [95% CI: 0.61, 0.80]) and external test set (AUC, 0.70 [95% CI: 0.58, 0.80] and 0.71 [95% CI: 0.60, 0.82]), with significant differences only for the internal test set (P < .001). The AUCs of the fellowship-trained radiologist were similar to those of the best-performing model (internal test set, 0.86 [95% CI: 0.78, 0.93; P = .39]; external test set, 0.71 [95% CI: 0.60, 0.82; P = .46]). Conclusion Developed models showed a high discriminatory power to differentiate between benign and malignant vertebral fractures, surpassing or matching the performance of radiology residents and matching that of a fellowship-trained radiologist. © RSNA, 2024 See also the editorial by Booz and D'Angelo in this issue.

Read full abstract

Chest radiography is the most common diagnostic imaging examination performed in emergency departments (EDs). Augmenting clinicians with automated preliminary read assistants could help expedite their workflows, improve accuracy, and reduce the cost of care. To assess the performance of artificial intelligence (AI) algorithms in realistic radiology workflows by performing an objective comparative evaluation of the preliminary reads of anteroposterior (AP) frontal chest radiographs performed by an AI algorithm and radiology residents. This diagnostic study included a set of 72 findings assembled by clinical experts to constitute a full-fledged preliminary read of AP frontal chest radiographs. A novel deep learning architecture was designed for an AI algorithm to estimate the findings per image. The AI algorithm was trained using a multihospital training data set of 342 126 frontal chest radiographs captured in ED and urgent care settings. The training data were labeled from their associated reports. Image-based F1 score was chosen to optimize the operating point on the receiver operating characteristics (ROC) curve so as to minimize the number of missed findings and overcalls per image read. The performance of the model was compared with that of 5 radiology residents recruited from multiple institutions in the US in an objective study in which a separate data set of 1998 AP frontal chest radiographs was drawn from a hospital source representative of realistic preliminary reads in inpatient and ED settings. A triple consensus with adjudication process was used to derive the ground truth labels for the study data set. The performance of AI algorithm and radiology residents was assessed by comparing their reads with ground truth findings. All studies were conducted through a web-based clinical study application system. The triple consensus data set was collected between February and October 2018. The comparison study was preformed between January and October 2019. Data were analyzed from October to February 2020. After the first round of reviews, further analysis of the data was performed from March to July 2020. The learning performance of the AI algorithm was judged using the conventional ROC curve and the area under the curve (AUC) during training and field testing on the study data set. For the AI algorithm and radiology residents, the individual finding label performance was measured using the conventional measures of label-based sensitivity, specificity, and positive predictive value (PPV). In addition, the agreement with the ground truth on the assignment of findings to images was measured using the pooled κ statistic. The preliminary read performance was recorded for AI algorithm and radiology residents using new measures of mean image-based sensitivity, specificity, and PPV designed for recording the fraction of misses and overcalls on a per image basis. The 1-sided analysis of variance test was used to compare the means of each group (AI algorithm vs radiology residents) using the F distribution, and the null hypothesis was that the groups would have similar means. The trained AI algorithm achieved a mean AUC across labels of 0.807 (weighted mean AUC, 0.841) after training. On the study data set, which had a different prevalence distribution, the mean AUC achieved was 0.772 (weighted mean AUC, 0.865). The interrater agreement with ground truth finding labels for AI algorithm predictions had pooled κ value of 0.544, and the pooled κ for radiology residents was 0.585. For the preliminary read performance, the analysis of variance test was used to compare the distributions of AI algorithm and radiology residents' mean image-based sensitivity, PPV, and specificity. The mean image-based sensitivity for AI algorithm was 0.716 (95% CI, 0.704-0.729) and for radiology residents was 0.720 (95% CI, 0.709-0.732) (P = .66), while the PPV was 0.730 (95% CI, 0.718-0.742) for the AI algorithm and 0.682 (95% CI, 0.670-0.694) for the radiology residents (P < .001), and specificity was 0.980 (95% CI, 0.980-0.981) for the AI algorithm and 0.973 (95% CI, 0.971-0.974) for the radiology residents (P < .001). These findings suggest that it is possible to build AI algorithms that reach and exceed the mean level of performance of third-year radiology residents for full-fledged preliminary read of AP frontal chest radiographs. This diagnostic study also found that while the more complex findings would still benefit from expert overreads, the performance of AI algorithms was associated with the amount of data available for training rather than the level of difficulty of interpretation of the finding. Integrating such AI systems in radiology workflows for preliminary interpretations has the potential to expedite existing radiology workflows and address resource scarcity while improving overall accuracy and reducing the cost of care.

Read full abstract

Radiology Resident Performance Research Articles

Related Topics

Articles published on Radiology Resident Performance

Training on contrast-enhanced ultrasound LI-RADS classification for resident radiologists: a retrospective comparison of performance after training

Improving the diagnostic performance of inexperienced readers for thyroid nodules through digital self-learning and artificial intelligence assistance.

Deep Learning to Differentiate Benign and Malignant Vertebral Fractures at Multidetector CT.

A Visual Aid Tool for Detection of Pancreatic Tumour-Vessel Contact on Staging CT: A Retrospective Cohort Study.

Impact of deep learning on radiologists and radiology residents in detecting breast cancer on CT: a cross-vendor test study

The moderating role of resilience in the association between workload and depressive symptoms among radiology residents in China: results from a nationwide cross-sectional study.

Structured report improves radiology residents' performance in reporting chest high-resolution computed tomography: a study in patients with connective tissue disease.

Interpretation of computed tomography of the cervical spine by non-radiologists: a systematic review and meta-analysis.

Resident-attending discrepancy rates for two consecutive versus nonconsecutive weeks of overnight shifts.

Evaluation of neuroradiology emergency MRI interpretations: low discrepancy rates between on-call radiology residents' preliminary interpretations and neuroradiologists' final reports.

Diagnostic Performance of Radiology Residents in Thoracic CT Imaging in Emergency Radiology During The COVID-19 Pandemic

Radiology resident selection factors predict resident performance

Brain MRI Deep Learning and Bayesian Inference System Augments Radiology Resident Performance.

Comparison of Chest Radiograph Interpretations by Artificial Intelligence Algorithm vs Radiology Residents

USMLE Step 3 Scores Have Value in Predicting ABR Core Examination Outcome and Performance: A Multi-institutional Study

Impact of Simulation Training on Radiology Resident Performance in Neonatal Head Ultrasound

The Relationship Between US Medical Licensing Examination Step Scores and ABR Core Examination Outcome and Performance: A Multi-institutional Study.

Performance of On-Call Radiology Residents in Interpreting Total Spine MRI Studies for the Detection of Spinal Cord Compression or Cauda Equina Compression.

Training with simulated lung nodules in X-rays can improve the localization performance of radiology residents

Assessing Competence in Emergency Radiology Using an Online Simulator

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Radiology Resident Performance Research Articles

Related Topics

Articles published on Radiology Resident Performance

Training on contrast-enhanced ultrasound LI-RADS classification for resident radiologists: a retrospective comparison of performance after training

Improving the diagnostic performance of inexperienced readers for thyroid nodules through digital self-learning and artificial intelligence assistance.

Deep Learning to Differentiate Benign and Malignant Vertebral Fractures at Multidetector CT.

A Visual Aid Tool for Detection of Pancreatic Tumour-Vessel Contact on Staging CT: A Retrospective Cohort Study.

Impact of deep learning on radiologists and radiology residents in detecting breast cancer on CT: a cross-vendor test study

The moderating role of resilience in the association between workload and depressive symptoms among radiology residents in China: results from a nationwide cross-sectional study.

Structured report improves radiology residents' performance in reporting chest high-resolution computed tomography: a study in patients with connective tissue disease.

Interpretation of computed tomography of the cervical spine by non-radiologists: a systematic review and meta-analysis.

Resident-attending discrepancy rates for two consecutive versus nonconsecutive weeks of overnight shifts.

Evaluation of neuroradiology emergency MRI interpretations: low discrepancy rates between on-call radiology residents' preliminary interpretations and neuroradiologists' final reports.

Diagnostic Performance of Radiology Residents in Thoracic CT Imaging in Emergency Radiology During The COVID-19 Pandemic

Radiology resident selection factors predict resident performance

Brain MRI Deep Learning and Bayesian Inference System Augments Radiology Resident Performance.

Comparison of Chest Radiograph Interpretations by Artificial Intelligence Algorithm vs Radiology Residents

USMLE Step 3 Scores Have Value in Predicting ABR Core Examination Outcome and Performance: A Multi-institutional Study

Impact of Simulation Training on Radiology Resident Performance in Neonatal Head Ultrasound

The Relationship Between US Medical Licensing Examination Step Scores and ABR Core Examination Outcome and Performance: A Multi-institutional Study.

Performance of On-Call Radiology Residents in Interpreting Total Spine MRI Studies for the Detection of Spinal Cord Compression or Cauda Equina Compression.

Training with simulated lung nodules in X-rays can improve the localization performance of radiology residents

Assessing Competence in Emergency Radiology Using an Online Simulator