Abstract

"Just Accepted" papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To evaluate the robustness of an award-winning bone age deep learning (DL) model to extensive variations in image appearance. Materials and Methods In December 2021, the DL bone age model that won the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated using the Radiological Society of North America (RSNA) validation set (n = 1425 pediatric hand radiographs; internal test set) and the Digital Hand Atlas (DHA; n = 1202 pediatric hand radiographs; external test set). Each test image underwent seven types of transformations (rotations, flips, brightness, contrast, inversion, laterality marker, and resolution) to represent a range of image appearances, many of which simulate real-world variations. Computational "stress tests" were performed by comparing the model's predictions on baseline and transformed images. Mean absolute differences (MAD) of predicted bone ages compared with radiologist-determined ground truth on baseline versus transformed images were compared using Wilcoxon Signed Rank tests. The proportion of clinically significant errors (CSE) was compared using McNemar's tests. Results There was no evidence of a difference in MAD of the model on the two baseline test sets (RSNA = 6.8, DHA = 6.9; P = .05), indicating good model generalization to external data. Except for the RSNA images with an appended radiologic laterality marker (P = .86), there were significant differences in MAD for both the DHA and RSNA datasets among other transformation groups (rotations, flips, brightness, contrast, inversion, and resolution). There were significant differences in proportion of CSEs for 57.6% (19/33) of the image transformations performed on the DHA dataset. Conclusion Although an award-winning pediatric bone age DL model generalized well to curated external images, it had inconsistent predictions on images that had undergone simple transformations reflective of several real-world variations in image appearance. ©RSNA, 2024.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call