Abstract

Visual relationship detection is crucial for understanding visual scenes and is widely used in many areas, including visual navigation, visual question answering, and machine trouble detection. Traditional detection methods often fuse multiple region modules, which takes considerable time and resources to train every module with extensive samples. As every module is independent, the computation process has difficulty achieving unity and lacks a higher level of logical reasonability. In response to the above problems, we propose a novel method of affix-tuning transformers for visual relationship detection tasks, which keeps transformer model parameters frozen and optimizes a small continuous task-specific vector. It not only makes the model unified and reduces the training cost but also maintains the commonsense reasonability without multiscale training. In addition, we design a vision-and-language sentence expression prompt template and train a few transformer model parameters for downstream tasks. Our method, Prompt Template and Affix-Tuning Transformers (PTAT), is evaluated on visual relationship detection and Visual Genome datasets. Finally, the results of the proposed method are close to or even higher than those of the state-of-the-art methods on some evaluation metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call