Abstract

Visual relationship detection is crucial for understanding visual scenes and is widely used in many areas, including visual navigation, visual question answering, and machine trouble detection. Traditional detection methods often fuse multiple region modules, which takes considerable time and resources to train every module with extensive samples. As every module is independent, the computation process has difficulty achieving unity and lacks a higher level of logical reasonability. In response to the above problems, we propose a novel method of affix-tuning transformers for visual relationship detection tasks, which keeps transformer model parameters frozen and optimizes a small continuous task-specific vector. It not only makes the model unified and reduces the training cost but also maintains the commonsense reasonability without multiscale training. In addition, we design a vision-and-language sentence expression prompt template and train a few transformer model parameters for downstream tasks. Our method, Prompt Template and Affix-Tuning Transformers (PTAT), is evaluated on visual relationship detection and Visual Genome datasets. Finally, the results of the proposed method are close to or even higher than those of the state-of-the-art methods on some evaluation metrics.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.