Exploiting the Social-Like Prior in Transformer for Visual Reasoning

Yudong Han,Mingzhu Xu,Yupeng Hu,Haoyu Tang,Liqiang Nie,Xuemeng Song

doi:10.1609/aaai.v38i3.27977

Abstract

Benefiting from instrumental global dependency modeling of self-attention (SA), transformer-based approaches have become the pivotal choices for numerous downstream visual reasoning tasks, such as visual question answering (VQA) and referring expression comprehension (REC). However, some studies have recently suggested that SA tends to suffer from rank collapse thereby inevitably leads to representation degradation as the transformer layer goes deeper. Inspired by social network theory, we attempt to make an analogy between social behavior and regional information interaction in SA, and harness two crucial notions of structural hole and degree centrality in social network to explore the possible optimization towards SA learning, which naturally deduces two plug-and-play social-like modules. Based on structural hole, the former module allows to make information interaction in SA more structured, which effectively avoids redundant information aggregation and global feature homogenization for better rank remedy, followed by latter module to comprehensively characterize and refine the representation discrimination via considering degree centrality of regions and transitivity of relations. Without bells and whistles, our model outperforms a bunch of baselines by a noticeable margin when considering our social-like prior on five benchmarks in VQA and REC tasks, and a series of explanatory results are showcased to sufficiently reveal the social-like behaviors in SA.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Exploiting the Social-Like Prior in Transformer for Visual Reasoning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

Improving Automatic VQA Evaluation Using Large Language Models
Oscar Mañas ... Aishwarya Agrawal
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Oscar Mañas, et. al.Oscar Mañas ... Aishwarya Agrawal
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Neural Networks for Detecting Irrelevant Questions During Visual Question Answering
Mengdi Li ... Cornelius Weber
-
Mengdi Li, et. al.Mengdi Li ... Cornelius Weber
01 Jan 2020
01 Jan 2020

Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention
Wei Suo ... Peng Wang
-
Wei Suo, et. al.Wei Suo ... Peng Wang
01 Aug 2021
01 Aug 2021

Normalized and Geometry-Aware Self-Attention Network for Image Captioning
Longteng Guo ... Peng Yao
-
Longteng Guo, et. al.Longteng Guo ... Peng Yao
01 Jun 2020
01 Jun 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploiting the Social-Like Prior in Transformer for Visual Reasoning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence