Semantic segmentation is a kind of classification at the pixel level, which has a remarkable effect in dealing with complex scenes. Most popular semantic segmentation models are based on Convolutional Neural Network (CNN) and Transformer structures. Convolution operations and attention modules are good at extracting local and global features. In this paper, we propose a parallel network structure, termed DualSeg, which leverages the advantages of CNN at local processing and Transformer at global interaction. And a feature fusion module (FFM) is designed to fuse parameters between modules in different branches to retain local and global features to the maximum extent. In the vineyards, the grape clusters and grape peduncles are often heavily obscured by leaves, branches and clusters, making it difficult to distinguish them accurately. Therefore, the scene is taken as an example to verify the segmentation effect of the model. In the experiment, the DualSeg model and mainstream segmentation models are used to test the segmentation effect of different models in this scene. And the experimental results showed that the DualSeg model had the best segmentation effect of all models. Specifically, the model had an IoU value of grape peduncle segmentation of 72.1%, which was more than 3.9% higher than that of other models in the case of 80 K iterations. In addition, the mIoU value of this model is 83.7%, which was the highest value compared with other models. The demonstrated performance of DualSeg model indicated that harvesting robots could be use it.