Abstract

This paper investigates how to effectively mine contextual information among sequential images and jointly model them in medical imaging tasks. Different from state-of-the-art methods that model sequential correlations via point-wise token encoding, this paper develops a novel hierarchical pattern-aware tokenization strategy. It handles distinct visual patterns independently and hierarchically, which not only ensures the full flexibility of attention aggregation under different pattern representations but also preserves both local and global information simultaneously. Based on this strategy, we propose a Pattern-Aware Transformer (PATrans) featuring a global-local dual-path pattern-aware cross-attention mechanism to achieve hierarchical pattern matching and propagation among sequential images. Furthermore, PATrans is plug-and-play and can be seamlessly integrated into various backbone networks for diverse downstream sequence modeling tasks. We demonstrate its general application paradigm across four domains and five benchmarks in video object detection and 3D volumetric semantic segmentation tasks, respectively. Impressively, PATrans sets new state-of-the-art across all these benchmarks, i.e., CVC-Video (92.3% detection F1), ASU-Mayo (99.1% localization F1), Lung Tumor (78.59% DSC), Nasopharynx Tumor (75.50% DSC), and Kidney Tumor (87.53% DSC). Codes and models are available at https://github.com/GGaoxiang/PATrans.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call