Recent visual question answering (VQA) frameworks employ different attention modules to derive a correct answer. The concept of attention is heavily established in human cognition, which led to its magnificent success in deep neural networks. In this study, we aim to consider a VQA framework that utilizes human biological and psychological concepts to achieve a good understanding of vision and language modalities. In this view, we introduce a hierarchical reasoning method based on the perception action cycle (HIPA) framework to tackle VQA tasks. The perception action cycle (PAC) explains how humans learn about and interact with their surrounding world. The proposed framework integrates the reasoning process of multi-modalities with the concepts introduced in PAC in multiple phases. It comprehends the visual modality through three phases of reasoning: object-level attention, organization, and interpretation. In addition, it comprehends the language modality through word-level attention, interpretation, and conditioning. Subsequently, vision and language modalities are interpreted dependently in a cyclic and hierarchical way throughout the entire framework. For further assessment of the generated visual and language features, we argue that image–question pairs of the same answer ought to eventually have similar visual and language features. As a result, we conduct visual and language feature evaluation experiments using metrics such as the standard deviation of cosine similarity and of Manhattan distance. We show that employing PAC in our framework improves the standard deviation compared with other VQA frameworks. For further assessment, we also test the novel proposed HIPA on the visual relationship detection (VRD) task. The proposed method achieves state-of-the-art results on the TDIUC and VRD datasets and obtains competitive results on the VQA 2.0 dataset. The code is available: github.com/Safaa1113/HiPA-Framework.
Read full abstract