Grounded Situation Recognition (GSR) aims to generate structured image descriptions. For a given image, GSR needs to identify the key verb, the nouns corresponding to roles, and their bounding-box groundings. However, current GSR research demands numerous meticulously labeled images, which are labor-intensive and time-consuming, making it costly to expand detection categories. Our study enhances model accuracy in detecting and localizing under data scarcity, reducing dependency on large datasets and paving the way for broader detection capabilities. In this paper, we propose the Grounded Situation Recognition under Data Scarcity (GSRDS) model, which uses the CoFormer model as the baseline and optimizes three subtasks: image feature extraction, verb classification, and bounding-box localization, to better adapt to data-scarce scenarios. Specifically, we replace ResNet50 with EfficientNetV2-M for advanced image feature extraction. Additionally, we introduce the Transformer Combined with CLIP for Verb Classification (TCCV) module, utilizing features extracted by CLIP’s image encoder to enhance verb classification accuracy. Furthermore, we design the Multi-source Verb-Role Queries (Multi-VR Queries) and the Dual Parallel Decoders (DPD) modules to improve the accuracy of bounding-box localization. Through extensive comparative experiments and ablation studies, we demonstrate that our method achieves higher accuracy than mainstream approaches in data-scarce scenarios. Our code will be available at https://github.com/Zhou-maker-oss/GSRDS.