Abstract

ImportanceProper evaluation of the performance of artificial intelligence techniques in the analysis of digitized medical images is paramount for the adoption of such techniques by the medical community and regulatory agencies.ObjectivesTo compare several cross-validation (CV) approaches to evaluate the performance of a classifier for automatic grading of prostate cancer in digitized histopathologic images and compare the performance of the classifier when trained using data from 1 expert and multiple experts.Design, Setting, and ParticipantsThis quality improvement study used tissue microarray data (333 cores) from 231 patients who underwent radical prostatectomy at the Vancouver General Hospital between June 27, 1997, and June 7, 2011. Digitized images of tissue cores were annotated by 6 pathologists for 4 classes (benign and Gleason grades 3, 4, and 5) between December 12, 2016, and October 5, 2017. Patches of 192 µm2 were extracted from these images. There was no overlap between patches. A deep learning classifier based on convolutional neural networks was trained to predict a class label from among the 4 classes (benign and Gleason grades 3, 4, and 5) for each image patch. The classification performance was evaluated in leave-patches-out CV, leave-cores-out CV, and leave-patients-out 20-fold CV. The analysis was performed between November 15, 2018, and January 1, 2019.Main Outcomes and MeasuresThe classifier performance was evaluated by its accuracy, sensitivity, and specificity in detection of cancer (benign vs cancer) and in low-grade vs high-grade differentiation (Gleason grade 3 vs grades 4-5). The statistical significance analysis was performed using the McNemar test. The agreement level between pathologists and the classifier was quantified using a quadratic-weighted κ statistic.ResultsOn 333 tissue microarray cores from 231 participants with prostate cancer (mean [SD] age, 63.2 [6.3] years), 20-fold leave-patches-out CV resulted in mean (SD) accuracy of 97.8% (1.2%), sensitivity of 98.5% (1.0%), and specificity of 97.5% (1.2%) for classifying benign patches vs cancerous patches. By contrast, 20-fold leave-patients-out CV resulted in mean (SD) accuracy of 85.8% (4.3%), sensitivity of 86.3% (4.1%), and specificity of 85.5% (7.2%). Similarly, 20-fold leave-cores-out CV resulted in mean (SD) accuracy of 86.7% (3.7%), sensitivity of 87.2% (4.0%), and specificity of 87.7% (5.5%). Results of McNemar tests showed that the leave-patches-out CV accuracy, sensitivity, and specificity were significantly higher than those for both leave-patients-out CV and leave-cores-out CV. Similar results were observed for classifying low-grade cancer vs high-grade cancer. When trained on a single expert, the overall agreement in grading between pathologists and the classifier ranged from 0.38 to 0.58; when trained using the majority vote among all experts, it was 0.60.Conclusions and RelevanceResults of this study suggest that in prostate cancer classification from histopathologic images, patch-wise CV and single-expert training and evaluation may lead to a biased estimation of classifier’s performance. To allow reproducibility and facilitate comparison between automatic classification methods, studies in the field should evaluate their performance using patient-based CV and multiexpert data. Some of these conclusions may be generalizable to other histopathologic applications and to other applications of machine learning in medicine.

Highlights

  • In the last decade, the literature on medical imaging in general, and on digital pathology in particular, has seen a dramatic increase in articles involving artificial intelligence and machine learning for automatic image analysis and classification,[1,2] as part of the development of computer-aided diagnosis systems to increase accuracy, reproducibility, and efficient throughput

  • We demonstrated that some of the studies on prostate cancer classification from histopathologic images that have been published in recent years have followed flawed experimental designs

  • We showed that training on data provided by a single expert can lead to biased results that have poor generalizability compared with a model trained on data from multiple experts

Read more

Summary

Introduction

The literature on medical imaging in general, and on digital pathology in particular, has seen a dramatic increase in articles involving artificial intelligence and machine learning for automatic image analysis and classification,[1,2] as part of the development of computer-aided diagnosis systems to increase accuracy, reproducibility, and efficient throughput. This trend has been enabled by an increase in computational power, improvement of image processing and machine learning algorithms, and the availability of more comprehensive data sets for training and evaluation. This work is focused on classifiers as we target the problem of classifying histopathologic images into several classes such as benign, low-grade cancer, and high-grade cancer

Objectives
Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.