Point cloud semantic segmentation is an ingredient in understanding real-world scenes. Most existing approaches perform poorly on scene boundaries and struggle with recognizing objects of different scales. In this paper, we propose a novel framework that incorporates Transformer into the U-Net architecture for inferring pointwise semantics. Specifically, the Transformer-based cross-feature fusion module is designed first to employ geometric and semantic information to learn feature offsets to overcome the border ambiguity of segmentation results, and then it utilizes the Transformer to learn cross-feature enhanced and fused encoder features. Additionally, to facilitate the overall network’s structure-to-detail perception capabilities, the adaptive perception module is designed, which employs cross-attention to adaptively allocate weights to encoder features at varying resolutions, establishing long-range contextual dependencies. Ablation studies validate the individual contributions of our module design choices. Compared with the existing competitive methods, our approach achieves state-of-the-art performance and exhibits superior results on benchmarks. Code is available at https://github.com/xiluo-cug/TCFAP-Net.