Multi-modal medical images are important in tumor lesion detection. However, the existing detection models only use single-modal to detect lesions, a multi-modal semantic correlation is not enough to consider and lacks ability to express the shape, size, and contrast degree features of lesions. A Cross Modal YOLOv5 model (CMYOLOv5) is proposed. Firstly, there are two networks, auxiliary network is consisted by dual-branch structure to extract semantic information from PET and CT, backbone network is consisted by YOLOv5 to extract semantic information from PET/CT. Secondly, Cross-modal Features Fusion (CFF) is designed in auxiliary network to fuse PET functional information and CT anatomical information. Self-Adaptive Attention Fusion (AAF) is designed in backbone network to fuse and enhance three-modal complementary information. Thirdly, Self-Adaptive Transformer (SAT) is designed in feature enhance neck. Using Transformer with deformable attention mechanism to focus on lung tumor region. Using MLP with channel attention mechanism to enhance features representation ability of lung tumor region. Finally, Reparameter Residual Block (RRB) and Reparameter Convolution operation (RC) are designed to fully learn richer PET, CT and PET/CT feature. Comparative experiments are conducted on clinical lung tumor PET/CT multi-modality dataset, the effectiveness of CMYOLOv5 is verified by Precision, Recall, mAP, F1, FPS, and training time, experimental results are 97.16%, 96.41%, 97.18%, 96.78%, 96.37 and 3912 s. CMYOLOv5 has high precision in the detection of irregular lung tumors, which is superior to the existing advanced methods.
Read full abstract