Cross-Attention Based Multi-Scale Feature Fusion Vision Transformer For Breast Ultrasound Image Classification

Lele Li,Meng Wu,Lang Wang,Yu Jin,Peng Jiang,Jing Feng,Ziling Wu,Juan Liu

doi:10.1109/bibm55620.2022.9994966

Abstract

Breast cancer has become one of the most common cancers in the world, and it is also the most lethal cancer in women. As a non-invasive imaging modality, ultrasonography can diagnose the degree of breast lesions and be used for large-scale screening. However, since the lesions in breast ultrasound(BUS) images are morphologically diverse, accompanied by relatively low contrast and complex textures, BUS image recognition faces greater challenges than natural images. In this study, We propose a novel network architecture that combines convolutional neural network(CNN) with vision transformer(ViT) to aggregate local feature details and long-range feature dependencies. Moreover, in order to perform multi-scale feature fusion, we introduce cross attention between the deep feature map and the shallow feature map in the network block to carry out the interaction between the deep feature and the shallow feature information. To verify the effectiveness of the model, we constructed a large-scale dataset and conducted extensive experiments. The results show that our method achieves an accuracy of 85.33%, under the comparable parameter complexity, which outperforms most convolutional neural networks(CNNs) and vision transformers (ViTs).

Full Text