Abstract
Mirror detection is a challenging task since mirrors do not possess a consistent visual appearance. Even the Segment Anything Model (SAM), which boasts superior zero-shot performance, cannot accurately detect the position of mirrors. Existing methods determine the position of the mirror under hypothetical conditions, such as the correspondence between objects inside and outside the mirror, and the semantic association between the mirror and surrounding objects. However, these assumptions do not apply to all scenarios. For instance, there may be no corresponding real objects to the reflected objects in the scene, or it may be challenging to extract meaningful semantic associations in complex scenes. On the other hand, humans can easily recognize mirrors through the specular texture caused by materials. To mine mirror features in more general scenes, we propose a Cross-Space-Frequency Window Transformer (CSFwinformer) to extract spatial and frequency features for texture analysis. Specifically, we design a Spatial-Frequency Window Alignment module (SFWA) to calculate spatial-frequency feature affinities and learn the difference between mirror and non-mirror textures. We then propose a Dilated Window Attention (DWA) to extract global features to complement the limitation of window alignment. Besides, we propose a Cross-Modality Context Contrast module (CMCC) to fuse cross-modality features and global features, which enables information flow between different windows to take full advantage of cross-modality information. Extensive experiments show that our method performs favorably against state-of-the-art methods on three mirror detection benchmarks and significantly improved SAM performance on mirror detection. The code is available at https://github.com/wangsen99/CSFwinformer.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have