This paper presents an advanced methodology that combines Convolutional Neural Networks (CNNs) and Transformers to tackle the issue of feature loss and elevate segmentation accuracy. While Transformers excel at capturing global features, there is a potential risk of losing crucial global information when CNNs concentrate on local feature extraction. To counteract this, we introduce the FM module, which re-extracts local information from the Encoder’s output, reinforcing local feature expression and enhancing segmentation accuracy. Additionally, our approach places a significant emphasis on the judicious fusion of features in the Decoder. By incorporating a technology that aggregates both global and local features, we aim to prevent the loss of feature information. To augment the effective representation of data features and heighten accuracy across diverse segmentation tasks, we introduce the GS Feature Combination method, which adjusts the weights of different features during the aggregation process. Experimental results demonstrate noteworthy performance improvements, achieving an Intersection over Union (IOU) of 71.3%, surpassing existing methods. This innovative approach bears substantial importance in the diagnosis, treatment, and prognosis prediction of osteosarcoma, affording doctors the opportunity to reduce workload and time while upholding diagnostic precision.