Cascaded Inner-Outer Clip Retformer for Ultrasound Video Object Segmentation.

Jialu Li,Lei Zhu,Zhaohu Xing,Baoliang Zhao,Ying Hu,Faqin Lv,Qiong Wang

doi:10.1109/jbhi.2024.3464732

Abstract

Computer-aided ultrasound (US) imaging is an important prerequisite for early clinical diagnosis and treatment. Due to the harsh ultrasound (US) image quality and the blurry tumor area, recent memory-based video object segmentation models (VOS) achieve frame-level segmentation by performing intensive similarity matching among the past frames which could inevitably result in computational redundancy. Furthermore, the current attention mechanism utilized in recent models only allocates the same attention level among whole spatial-temporal memory features without making distinctions, which may result in accuracy degradation. In this paper, we first build a larger annotated benchmark dataset for breast lesion segmentation in ultrasound videos, then we propose a lightweight clip-level VOS framework for achieving higher segmentation accuracy while maintaining the speed. The Inner-Outer Clip Retformer is proposed to extract spatialtemporal tumor features in parallel. Specifically, the proposed Outer Clip Retformer extracts the tumor movement feature from past video clips to locate the current clip tumor position, while the Inner Clip Retformer detailedly extracts current tumor features that can produce more accurate segmentation results. Then a Clip Contrastive loss function is further proposed to align the extracted tumor features along both the spatial-temporal dimensions to improve the segmentation accuracy. In addition, the Global Retentive Memory is proposed to maintain the complementary tumor features with lower computing resources which can generate coherent temporal movement features. In this way, our model can significantly improve the spatial-temporal perception ability without increasing a large number of parameters, achieving more accurate segmentation results while maintaining a faster segmentation speed. Finally, we conduct extensive experiments to evaluate our proposed model on several video object segmentation datasets, the results show that our framework outperforms state-of-theart segmentation methods.

Full Text