PurposeArtificial intelligence (AI)-aided methods have made significant progress in the auto-delineation of normal tissues. However, these approaches struggle with the auto-contouring of radiotherapy target volume. Our goal is to model the delineation of target volume as a clinical decision-making problem, resolved by leveraging large language model-aided multimodal learning approaches. Methods and MaterialsA vision-language model, termed Medformer, has been developed, employing the hierarchical vision transformer as its backbone, and incorporating large language models to extract text-rich features. The contextually embedded linguistic features are seamlessly integrated into visual features for language-aware visual encoding through the visual language attention module. Metrics, including Dice similarity coefficient (DSC), intersection over union (IOU), and 95th percentile Hausdorff distance (HD95), were used to quantitatively evaluate the performance of our model. The evaluation was conducted on an in-house prostate cancer dataset and a public oropharyngeal carcinoma (OPC) dataset, totaling 668 subjects. ResultsOur Medformer achieved a DSC of 0.81 ± 0.10 versus 0.72 ± 0.10, IOU of 0.73 ± 0.12 versus 0.65 ± 0.09, and HD95 of 9.86 ± 9.77 mm versus 19.13 ± 12.96 mm for delineation of gross tumor volume (GTV) on the prostate cancer dataset. Similarly, on the OPC dataset, it achieved a DSC of 0.77 ± 0.11 versus 0.72 ± 0.09, IOU of 0.70 ± 0.09 versus 0.65 ± 0.07, and HD95 of 7.52 ± 4.8 mm versus 13.63 ± 7.13 mm, representing significant improvements (p < 0.05). For delineating the clinical target volume (CTV), Medformer achieved a DSC of 0.91 ± 0.04, IOU of 0.85 ± 0.05, and HD95 of 2.98 ± 1.60 mm, comparable to other state-of-the-art algorithms. ConclusionsAuto-delineation of the treatment target based on multimodal learning outperforms conventional approaches that rely purely on visual features. Our method could be adopted into routine practice to rapidly contour CTV/GTV.