Transparent or specular objects, such as mirrors, glass windows, and glass walls, have a significant impact on computer vision tasks. Glass-like Objects (GLOS) encompass transparent or specular objects that lack distinctive visual appearances and specific external shapes, posing challenges for GLO segmentation. In this study, we propose a novel bidirectional cross-modal fusion framework with a shift-window cross-attention for GLO segmentation. The framework incorporates a Feature Exchange Module (FEM) and a Shifted-window Cross-attention Feature Fusion Module (SW-CAFM) in each transformer block stage to calibrate, exchange, and fuse cross-modal features. The FEM employs coordinate and spatial attention mechanisms to filter out the noise and recalibrate the features from two modalities. The Shifted-Window Cross-Modal Attention Fusion (SW-CAFM) uses cross-attention to fuse RGB and depth features, leveraging the shifted-window self-attention operation to reduce the computational complexity of cross-attention. The experimental results demonstrate the feasibility and high performance of the proposed method, achieving state-of-the-art results on various glass and mirror benchmarks. The method achieves mIoU accuracies of 90.32%, 94.24%, 88.76%, and 87.47% on the GDD, Trans10K, MSD, and RGBD-Mirror datasets, respectively.
Read full abstract