Object detection and segmentation have made great progress in robotic application missions. However, intelligent agents require fine-grained recognition algorithms rather than object-level and language instructions to enhance the interaction between humans and robots. To improve the robot’s interactivity in the process of the robot response to language instructions, we propose a method for part-level detection and segmentation by exploiting vision language models. In this approach, Swin Transformer is introduced in the image encoder for extracting image features, and FPNs (Feature Pyramid Networks) are modified to better process the features from Swin Transformer. Next, the image decoder is proposed for model aligning between the image features and text embeddings for achieving human–robot interaction via language. Finally, we verify that the text embeddings are impacted by the command of input and that different prompt templates also affect classification. The method proposed in this paper, which is validated on two datasets (PartImagePart and Pascal Part), possesses the ability to understand and execute part-level missions and accurately segments and detects parts compared with existing interactive methods.