Shared-Specific Feature Learning With Bottleneck Fusion Transformer for Multi-Modal Whole Slide Image Analysis.

Zhihua Wang,Xin Ding,Lequan Yu,Liansheng Wang,Xuehong Liao

doi:10.1109/tmi.2023.3287256

Abstract

The fusion of multi-modal medical data is essential to assist medical experts to make treatment decisions for precision medicine. For example, combining the whole slide histopathological images (WSIs) and tabular clinical data can more accurately predict the lymph node metastasis (LNM) of papillary thyroid carcinoma before surgery to avoid unnecessary lymph node resection. However, the huge-sized WSI provides much more high-dimensional information than low-dimensional tabular clinical data, making the information alignment challenging in the multi-modal WSI analysis tasks. This paper presents a novel transformer-guided multi-modal multi-instance learning framework to predict lymph node metastasis from both WSIs and tabular clinical data. We first propose an effective multi-instance grouping scheme, named siamese attention-based feature grouping (SAG), to group high-dimensional WSIs into representative low-dimensional feature embeddings for fusion. We then design a novel bottleneck shared-specific feature transfer module (BSFT) to explore the shared and specific features between different modalities, where a few learnable bottleneck tokens are utilized for knowledge transfer between modalities. Moreover, a modal adaptation and orthogonal projection scheme were incorporated to further encourage BSFT to learn shared and specific features from multi-modal data. Finally, the shared and specific features are dynamically aggregated via an attention mechanism for slide-level prediction. Experimental results on our collected lymph node metastasis dataset demonstrate the efficiency of our proposed components and our framework achieves the best performance with AUC (area under the curve) of 97.34%, outperforming the state-of-the-art methods by over 1.27%.

Full Text