Abstract
Effectively leveraging objects and optical character recognition (OCR) tokens to reason out pivotal scene text is critical for the challenging Text-based Visual Question Answering (TextVQA) task. Graph-based models can effectively capture the semantic relationship among visual entities (objects and tokens) and report remarkable performance in TextVQA. However, previous efforts usually leverage all visual entities and ignore the negative effect of superfluous entities. This article presents a Graph Pooling Inference Network (GPIN), which is an evolutionary graph learning method to purify the visual entities and capture the core semantics. It is observed that the dense distribution of reduplicative objects and the crowd of semantically dependent OCR tokens usually co-exist in the image. Motivated by this, GPIN adopts an adaptive node dropping strategy to dynamically downscale semantically closed nodes for graph evolution and update. To deepen the comprehension of scene text, GPIN is a dual-path hierarchical graph architecture that progressively aggregates the evolved object graph and the evolved token graph semantics into a graph vector that serves as visual cues to facilitate the answer reasoning. It can effectively eliminate object redundancy and enhance the association of semantically continuous tokens. Experiments conducted on TextVQA and ST-VQA datasets show that GPIN achieves promising performance compared with state-of-the-art methods.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have