Independent evaluation of a multi-view multi-task convolutional neural network breast cancer classification model using Finnish mammography screening data

A Isosalo,S.I Inkinen,T Turunen,P.S Ipatti,J Reponen,M.T Nieminen

doi:10.1016/j.compbiomed.2023.107023

A Isosalo, S.I Inkinen + Show 4 more

Open Access

https://doi.org/10.1016/j.compbiomed.2023.107023

Copy DOI

Abstract

Background:Development of deep convolutional neural networks for breast cancer classification has taken significant steps towards clinical adoption. It is though unclear how the models perform for unseen data, and what is required to adapt them to different demographic populations. In this retrospective study, we adopt an openly available pre-trained mammography breast cancer multi-view classification model and evaluate it by utilizing an independent Finnish dataset. Methods:Transfer learning was used, and the pre-trained model was finetuned with 8,829 examinations from the Finnish dataset (4,321 normal, 362 malignant and 4,146 benign examinations). Holdout dataset with 2,208 examinations from the Finnish dataset (1,082 normal, 70 malignant and 1,056 benign examinations) was used in the evaluation. The performance was also evaluated on a manually annotated malignant suspect subset. Receiver Operating Characteristic (ROC) and Precision–Recall curves were used to performance measures. Results:The Area Under ROC [95%CI] values for malignancy classification obtained with the finetuned model for the entire holdout set were 0.82 [0.76, 0.87], 0.84 [0.77, 0.89], 0.85 [0.79, 0.90], and 0.83 [0.76, 0.89] for R-MLO, L-MLO, R-CC and L-CC views respectively. Performance on the malignant suspect subset was slightly better. On the auxiliary benign classification task performance remained low. Conclusions:The results indicate that the model performs well also in an out-of-distribution setting. Finetuning allowed the model to adapt to some of the underlying local demographics. Future research should concentrate to identify breast cancer subgroups adversely affecting performance, as it is a requirement for increasing the model’s readiness level for a clinical setting.

Full Text