Models that utilize self-attention mechanisms, including but not limited to Vision Transformers (ViTs), have shown promising performance in visual tasks like semantic segmentation. This is attributed to their capacity to capture global features of images, enabling them to learn more comprehensive representations. However, transformer-based models typically demand a considerable amount of training data to achieve satisfactory performance, while being deficient in the ability to efficiently extract local image features. As a result, these models may not be as effective in some computer vision tasks that involve small-scale datasets, like medical image segmentation. To address these issues, this paper proposes a dual-stream encoding-based transformer dubbed as Dual-stream Transformer (DS-Former). The dual-stream module in DS-Former can simultaneously acquire local and global features in the image and construct relation between the two kinds of features via self-attention. Compared with the simple splicing or serial connection, the dual-stream module can extract more comprehensive and hierarchical feature information from the fusion interaction of the two features. Our method is evaluated on the UK Biobank (UKBB) cardiac magnetic resonance imaging (CMR) dataset and The Beyond the Cranial Vault (BTCV) abdominal challenge dataset. The experimental results indicate that our DS-Former outperforms other state-of-the-art approaches on both datasets, indicating its potential for medical images semantic segmentation.