Referring image segmentation (RIS) aims to predict a segmentation mask for a target specified by a natural language expression. However, the existing methods failed to implement deep interaction between vision and language is needed in RIS, resulting inaccurate segmentation. To address the problem, a cross-modal transformer (CMT) with language queries for referring image segmentation is proposed. First, a cross-modal encoder of CMT is designed for intra-modal and inter-modal interaction, capturing context-aware visual features. Secondly, to generate compact visual-aware language queries, a language-query encoder (LQ) embeds key visual cues into linguistic features. In particular, the combination of the cross-modal encoder and language query encoder realizes the mutual guidance of vision and language. Finally, the cross-modal decoder of CMT is constructed to learn multimodal features of the referent from the context-aware visual features and visual-aware language queries. In addition, a semantics-guided detail enhancement (SDE) module is constructed to fuse the semantic-rich multimodal features with detail-rich low-level visual features, which supplements the spatial details of the predicted segmentation masks. Extensive experiments on four referring image segmentation datasets demonstrate the effectiveness of the proposed method.