Development and validation of a deep learning model for multicategory pneumonia classification on chest computed tomography: a multicenter and multireader study.

Chunzi Shi,Fengxiang Song,Yi Zhan,Jie Shen,Yang Lu,Nannan Shi,Yuxin Shi,Fei Shan,Jili Wu,Ying Shao,Chuan Chen,Keying Wang,Xueni Huang,Yaozong Gao

doi:10.21037/qims-23-1097

Abstract

Accurate diagnosis of pneumonia is vital for effective disease management and mortality reduction, but it can be easily confused with other conditions on chest computed tomography (CT) due to an overlap in imaging features. We aimed to develop and validate a deep learning (DL) model based on chest CT for accurate classification of viral pneumonia (VP), bacterial pneumonia (BP), fungal pneumonia (FP), pulmonary tuberculosis (PTB), and no pneumonia (NP) conditions. In total, 1,776 cases from five hospitals in different regions were retrospectively collected from September 2019 to June 2023. All cases were enrolled according to inclusion and exclusion criteria, and ultimately 1,611 cases were used to develop the DL model with 5-fold cross-validation, with 165 cases being used as the external test set. Five radiologists blindly reviewed the images from the internal and external test sets first without and then with DL model assistance. Precision, recall, F1-score, weighted F1-average, and area under the curve (AUC) were used to evaluate the model performance. The F1-scores of the DL model on the internal and external test sets were, respectively, 0.947 [95% confidence interval (CI): 0.936-0.958] and 0.933 (95% CI: 0.916-0.950) for VP, 0.511 (95% CI: 0.487-0.536) and 0.591 (95% CI: 0.557-0.624) for BP, 0.842 (95% CI: 0.824-0.860) and 0.848 (95% CI: 0.824-0.873) for FP, 0.843 (95% CI: 0.826-0.861) and 0.795 (95% CI: 0.767-0.822) for PTB, and 0.975 (95% CI: 0.968-0.983) and 0.976 (95% CI: 0.965-0.986) for NP, with a weighted F1-average of 0.883 (95% CI: 0.867-0.898) and 0.846 (95% CI: 0.822-0.871), respectively. The model performed well and showed comparable performance in both the internal and external test sets. The F1-score of the DL model was higher than that of radiologists, and with DL model assistance, radiologists achieved a higher F1-score. On the external test set, the F1-score of the DL model (F1-score 0.848; 95% CI: 0.824-0.873) was higher than that of the radiologists (F1-score 0.541; 95% CI: 0.507-0.575) as was its precision for the other three pneumonia conditions (all P values <0.001). With DL model assistance, the F1-score for FP (F1-score 0.541; 95% CI: 0.507-0.575) was higher than that achieved without assistance (F1-score 0.778; 95% CI: 0.750-0.807) as was its precision for the other three pneumonia conditions (all P values <0.001). The DL approach can effectively classify pneumonia and can help improve radiologists' performance, supporting the full integration of DL results into the routine workflow of clinicians.

Full Text