Abstract
In recent years, with the rapid development of Transformer in the field of natural language processing, many researchers have realized its potential and gradually applied it to the field of computer vision, with a proliferation of theoretical approaches represented by Vision Transformer (ViT) and Data-efficient image Transformer (DeiT). On the basis of ViT, the famous Swin-Transformer was proposed as one of the best computer vision neural network backbones, which can be widely used in tasks such as image classification, target detection and video recognition. However, in the field of image segmentation, the semantic segmentation of indoor scenes is still very challenging due to the wide variety of objects, large differences in object sizes, and a large number of overlapping objects with occlusion. Aiming at the problem that the existing semantic segmentation of RGB-D indoor scenes cannot effectively fuse multimodal features, in this paper, we propose a novel indoor semantic segmentation algorithm based on Swin-Transformer. It attempts to apply Swin-Transformer to the field of indoor RGBD semantic segmentation, and tests the performance of the model by conducting extensive experiments on the mainstream indoor semantic segmentation datasets NYU-Depth V2 and SUN RGB-D. The experimental results show that the Swin-L RGB+Depth setting achieves 52.44% MIoU on the NYU-Depth V2 data and 51.15% MIoU on the SUN RGB-D data set, which reflects an excellent performance in the field of indoor semantic segmentation. The improved performance of the Depth features on the indoor semantic segmentation model has also been demonstrated in the experiments by controlling the type of input features. Our source code is publicly available at https://github.com/YunpingZheng/ISSSW.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of Visual Communication and Image Representation
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.