In recent years, with the rapid development of Transformer in the field of natural language processing, many researchers have realized its potential and gradually applied it to the field of computer vision, with a proliferation of theoretical approaches represented by Vision Transformer (ViT) and Data-efficient image Transformer (DeiT). On the basis of ViT, the famous Swin-Transformer was proposed as one of the best computer vision neural network backbones, which can be widely used in tasks such as image classification, target detection and video recognition. However, in the field of image segmentation, the semantic segmentation of indoor scenes is still very challenging due to the wide variety of objects, large differences in object sizes, and a large number of overlapping objects with occlusion. Aiming at the problem that the existing semantic segmentation of RGB-D indoor scenes cannot effectively fuse multimodal features, in this paper, we propose a novel indoor semantic segmentation algorithm based on Swin-Transformer. It attempts to apply Swin-Transformer to the field of indoor RGBD semantic segmentation, and tests the performance of the model by conducting extensive experiments on the mainstream indoor semantic segmentation datasets NYU-Depth V2 and SUN RGB-D. The experimental results show that the Swin-L RGB+Depth setting achieves 52.44% MIoU on the NYU-Depth V2 data and 51.15% MIoU on the SUN RGB-D data set, which reflects an excellent performance in the field of indoor semantic segmentation. The improved performance of the Depth features on the indoor semantic segmentation model has also been demonstrated in the experiments by controlling the type of input features. Our source code is publicly available at https://github.com/YunpingZheng/ISSSW.