Accurate segmentation of lesions in medical images is a key step to assist clinicians in diagnosis and analysis. Most studies combine the Transformer model with CNN at a single scale or use the highest-level feature tensor extracted by CNN as input to Transformer without fully exploiting Transformer’s potential. In addition, for the problems of structural boundary blurring, heterogeneous textures, etc., in medical images, most existing methods pay attention to using contour information to solve this problem but simply fuse the contour information and ignore the potential relationship between the regions and the contours. We propose the DPCTN network based on the traditional encoding–decoding structure, consisting of the CNN, Transformer dual backbone networks and parallel attention mechanisms, to achieve accurate segmentation in medical image lesions. Local and global multiscale feature information is extracted by CNN and Transformer. The Transformer block of channel cross fusion can implement multiscale information fusion of the high-level local features and reduce the impact of the redundant information. The dual backbone feature fusion module effectively couples the local and global high-level feature information. The decoder refines and enriches the boundary and regional features, layer by layer, to achieve effective supervision of the boundary and region. Considering the possible dimension collapse in the attention mechanism, a novel three branch transposed self-attention module is designed to reduce the information loss caused by feature pooling. To verify the effectiveness of our proposed method, subjective and objective comparative experiments and ablation experiments were performed on four medical segmentation tasks, polyps, skin lesions, glands and breast tumors. A large number of experimental results show that our method is superior to the current state-of-the-art method, reduces the standard deviation and is more robust. Source code is released at https://github.com/sd-spf/DPCTN.