Detecting Spurious Correlations With Sanity Tests for Artificial Intelligence Guided Radiology Systems.

Usman Mahmood,Lorenzo Mannelli,Christopher Kanan,Yusuf Emre Erdi,Robik Shrestha,David D B Bates,Giuseppe Corrias

doi:10.3389/fdgth.2021.671015

Abstract

Artificial intelligence (AI) has been successful at solving numerous problems in machine perception. In radiology, AI systems are rapidly evolving and show progress in guiding treatment decisions, diagnosing, localizing disease on medical images, and improving radiologists' efficiency. A critical component to deploying AI in radiology is to gain confidence in a developed system's efficacy and safety. The current gold standard approach is to conduct an analytical validation of performance on a generalization dataset from one or more institutions, followed by a clinical validation study of the system's efficacy during deployment. Clinical validation studies are time-consuming, and best practices dictate limited re-use of analytical validation data, so it is ideal to know ahead of time if a system is likely to fail analytical or clinical validation. In this paper, we describe a series of sanity tests to identify when a system performs well on development data for the wrong reasons. We illustrate the sanity tests' value by designing a deep learning system to classify pancreatic cancer seen in computed tomography scans.

Highlights

Intelligent (AI) computer-aided diagnostic (CAD) systems have the potential to help radiologists on a multitude of tasks, ranging from tumor classification to improved image reconstruction [1,2,3,4]
We expect a system trained on one format to perform the best on test data processed in an identical manner, which is consistent with the self-test results along the diagonal of Figure 4
Its performance is significantly lower (P < 0.001) than systems trained on the original with the pancreas (WP) and without a pancreas (WOP), 0.95 AUC and 0.97 AUC, respectively

Summary

Introduction

Intelligent (AI) computer-aided diagnostic (CAD) systems have the potential to help radiologists on a multitude of tasks, ranging from tumor classification to improved image reconstruction [1,2,3,4]. For AI-based software as a medical device, the gold standard for analytical validation is to assess performance on previously unseen independent datasets [9,10,11,12], followed by a clinical validation study. Both steps pose challenges for medical AI. It is challenging to collect large cohorts of high-quality and diverse medical imaging data sets that are acquired in a consistent manner [13, 14] Both steps are time-consuming, and best practices dictate limited re-use of analytical validation data. The cost of failing the validation process could prohibit further development of particular applications

Objectives

Methods

Results

Conclusion