Virtual clinical trial for task-based evaluation of a deep learning synthetic mammography algorithm

Andreu Badal,Aldo Badano,Rongping Zeng,Sarah Divel,Kenny H Cha,Christian G Graff

doi:10.1117/12.2513062

Abstract

Image processing algorithms based on deep learning techniques are being developed for a wide range of medical applications. Processed medical images are typically evaluated with the same kind of image similarity metrics used for natural scenes, disregarding the medical task for which the images are intended. We propose a com- putational framework to estimate the clinical performance of image processing algorithms using virtual clinical trials. The proposed framework may provide an alternative method for regulatory evaluation of non-linear image processing algorithms. To illustrate this application of virtual clinical trials, we evaluated three algorithms to compute synthetic mammograms from digital breast tomosynthesis (DBT) scans based on convolutional neural networks previously used for denoising low dose computed tomography scans. The inputs to the networks were one or more noisy DBT projections, and the networks were trained to minimize the difference between the output and the corresponding high dose mammogram. DBT and mammography images simulated with the Monte Carlo code MC-GPU using realistic breast phantoms were used for network training and validation. The denoising algorithms were tested in a virtual clinical trial by generating 3000 synthetic mammograms from the public VICTRE dataset of simulated DBT scans. The detectability of a calcification cluster and a spiculated mass present in the images was calculated using an ensemble of 30 computational channelized Hotelling observers. The signal detectability results, which took into account anatomic and image reader variability, showed that the visibility of the mass was not affected by the post-processing algorithm, but that the resulting slight blurring of the images severely impacted the visibility of the calcification cluster. The evaluation of the algorithms using the pixel-based metrics peak signal to noise ratio and structural similarity in image patches was not able to predict the reduction in performance in the detectability of calcifications. These two metrics are computed over the whole image and do not consider any particular task, and might not be adequate to estimate the diagnostic performance of the post-processed images.

Full Text