Development of an AI system for accurately diagnose hepatocellular carcinoma from computed tomography imaging data.

Meiyun Wang,Peiting You,Fangfang Fu,Yichen Yang,Xiaoyue Ma,Jianqiang Wu,Dalu Kong,Hongru Shen,Lin Sun,Yan Bai,Qiuyu Liu,Bingjie Zheng,Mingge Liu,Xiangchun Li,Fei Tian,Qingxia Wu

doi:10.1038/s41416-021-01511-w

Abstract

Computed tomography (CT) scan is frequently used to detect hepatocellular carcinoma (HCC) in routine clinical practice. The aim of this study is to develop a deep-learning AI system to improve the diagnostic accuracy of HCC by analysing liver CT imaging data. We developed a deep-learning AI system by training on CT images from 7512 patients at Henan Provincial Peoples' Hospital. Its performance was validated on one internal test set (Henan Provincial Peoples' Hospital, n = 385) and one external test set (Henan Provincial Cancer Hospital, n = 556). The area under the receiver-operating characteristic curve (AUROC) was used as the primary classification metric. Accuracy, sensitivity, specificity, precision, negative predictive value and F1 metric were used to measure the performance of AI systems and radiologists. AI system achieved high performance in identifying HCC patients, with AUROC of 0.887 (95% CI 0.855-0.919) on the internal test set and 0.883 (95% CI 0.855-0.911) on the external test set. For internal test set, accuracy was 81.0% (76.8-84.8%), sensitivity was 78.4% (72.4-83.7%), specificity was 84.4% (78.0-89.6%) and F1 (harmonic average of precision and recall rate) was 0.824. For external test set, accuracy was 81.3% (77.8-84.5%), sensitivity was 89.4% (85.0-92.8%), specificity was 74.0% (68.5-78.9%) and F1 was 0.819. Compared with radiologists, AI system achieved comparable accuracy and F1 metric on internal test set (0.853 versus 0.818, P = 0.107; 0.863 vs. 0.824, P = 0.082) and external test set (0.805 vs. 0.793, P = 0.663; 0.810 vs. 0.814, P = 0.866). The predicted HCC risk scores by AI system in HCC patients with multiple tumours and high fibrosis stage were higher than those with solitary tumour and low fibrosis stage (tumour number: 0.197 vs. 0.138, P = 0.006; fibrosis stage: 0.183 vs. 0.127, P < 0.001). Radiologists' review showed that the accuracy of saliency heatmaps predicted by algorithms was 92.1% (95% CI: 89.2-95.0%). AI system achieved high performance in the detection of HCC compared with a group of specialised radiologists. Further investigation by prospective clinical trials was necessitated to verify this model.

Full Text