In the acquisition of weld seam images, the presence of light and impurity interference often leads to weak edge problems in weld seam images, making the identification of weld seam defects challenging. In such scenarios, segmentation processing must be performed specifically at the fusion pool of the weld seam. In recent years, Transformers, which leverage self-attention mechanisms, have demonstrated superior sequence modeling capabilities compared to convolutional neural networks (CNNs). The Transformer excels at weighting different positions within a sequence, enabling effective handling of long-range dependencies. However, Transformers lack the ability of CNNs to extract local information. To address this limitation, this paper introduces a novel CvT-UNet semantic segmentation model that combines the strengths of both the convolutional neural network and the transformer. The model combines the global context information capabilities of the Transformer with the spatial information advantages provided by the convolutional neural network. The proposed model employs an encoder-decoder architecture by building a U-shaped network with designed CvT blocks. Depthwise separable convolutions are utilized to balance parameter quantities, and the skip-connection module is redesigned to enable the model to achieve more precise segmentation of weld seam fusion pools with fewer parameters. Experimental evaluations on diverse weld seam datasets under different environmental conditions demonstrate promising results. The model achieves average intersection over union (IoU) values of 93.75 %, 88.31 %, and 90.86 %, respectively. Specifically, on an automotive weld seam dataset, CvT-UNet outperforms the purely convolutional neural network structure UNet3+, with a 1.63 % increase in the mean IoU (MIOU) and a 1.39 % increase in the mean pixel accuracy (MPA). Compared to the TransUNet fusion model, CvT-UNet results in a 0.33 % increase in the MIOU and a 0.6 % increase in the MPA. Furthermore, compared to the state-of-the-art curve segmentation model LIOT, CvT-UNet achieves a 1.94 % increase in the MIOU and a 0.61 % increase in the MPA, demonstrating favorable segmentation performance.