Use of deep learning model for paediatric elbow radiograph binomial classification: initial experience, performance and lessons learnt

Russ Yuezhi Chua,Pearlly Peiqi Chang,Marielle Valerie Fortier,Qiao Fan,Mark Bangwei Tan

doi:10.4103/singaporemedj.smj-2022-078

Abstract

Abstract Introduction: In this study, we aimed to compare the performance of a convolutional neural network (CNN)-based deep learning model that was trained on a dataset of normal and abnormal paediatric elbow radiographs with that of paediatric emergency department (ED) physicians on a binomial classification task. Methods: A total of 1,314 paediatric elbow lateral radiographs (patient mean age 8.2 years) were retrospectively retrieved and classified based on annotation as normal or abnormal (with pathology). They were then randomly partitioned to a development set (993 images); first and second tuning (validation) sets (109 and 100 images, respectively); and a test set (112 images). An artificial intelligence (AI) model was trained on the development set using the EfficientNet B1 network architecture. Its performance on the test set was compared to that of five physicians (inter-rater agreement: fair). Performance of the AI model and the physician group was tested using McNemar test. Results: The accuracy of the AI model on the test set was 80.4% (95% confidence interval [CI] 71.8%–87.3%), and the area under the receiver operating characteristic curve (AUROC) was 0.872 (95% CI 0.831–0.947). The performance of the AI model vs. the physician group on the test set was: sensitivity 79.0% (95% CI: 68.4%–89.5%) vs. 64.9% (95% CI: 52.5%–77.3%; P = 0.088); and specificity 81.8% (95% CI: 71.6%–92.0%) vs. 87.3% (95% CI: 78.5%–96.1%; P = 0.439). Conclusion: The AI model showed good AUROC values and higher sensitivity, with the P-value at nominal significance when compared to the clinician group.

Full Text