Background: Patients with malignant tumors often develop bone metastases. SPECT bone scintigraphy is an effective tool for surveying bone metastases due to its high sensitivity, low-cost equipment, and radiopharmaceutical. However, the low spatial resolution of SPECT scans significantly hinders manual analysis by nuclear medicine physicians. Deep learning, a promising technique for automated image analysis, can extract hierarchal patterns from images without human intervention. Objective: To enhance the performance of deep learning-based segmentation models, we integrate textual data from diagnostic reports with SPECT bone scans, aiming to develop an automated analysis method that outperforms purely unimodal data-driven segmentation models. Methods: We propose a dual-path segmentation framework to extract features from bone scan images and diagnostic reports separately. In the first path, an encoder-decoder network is employed to learn hierarchical representations of features from SPECT bone scan images. In the second path, the Chinese version of the MacBERT model is utilized to develop a text encoder for extracting features from diagnostic reports. The extracted textual features are then fused with image features during the decoding stage in the first path, enhancing the overall segmentation performance. Results: Experimental evaluation conducted on real-world clinical data demonstrated the superior performance of the proposed segmentation model. Our model achieved a 0.0209 increase in the DSC (Dice Similarity Coefficient) score compared to the well-known U-Net model. Conclusions: The proposed multimodal data-driven method effectively identifies and isolates metastasis lesions in SPECT bone scans, outperforming existing classical deep learning models. This study demonstrates the value of incorporating textual data in the deep learning-based segmentation of lowresolution SPECT bone scans.