Nasopharyngeal carcinoma (NPC) is a malignant tumor primarily treated by radiotherapy. Accurate delineation of the target tumor is essential for improving the effectiveness of radiotherapy. However, the segmentation performance of current models is unsatisfactory due to poor boundaries, large-scale tumor volume variation, and the labor-intensive nature of manual delineation for radiotherapy. In this paper, MMCA-Net, a novel segmentation network for NPC using PET/CT images that incorporates an innovative multimodal cross attention transformer (MCA-Transformer) and a modified U-Net architecture, is introduced to enhance modal fusion by leveraging cross-attention mechanisms between CT and PET data. Our method, tested against ten algorithms via fivefold cross-validation on samples from Sun Yat-sen University Cancer Center and the public HECKTOR dataset, consistently topped all four evaluation metrics with average Dice similarity coefficients of 0.815 and 0.7944, respectively. Furthermore, ablation experiments were conducted to demonstrate the superiority of our method over multiple baseline and variant techniques. The proposed method has promising potential for application in other tasks.