Abstract

In this study, we propose a novel encoder–decoder cycle (EDC) framework inspired by the human learning process called the perception-action cycle to tackle challenging problems such as visual question answering (VQA) and visual relationship detection (VRD). EDC considers the understanding of the visual features of an image as perception and the act of answering the question regarding that image as an action. In the perception-action cycle, information is primarily collected from the environment and then passed to sensory structures in the brain to form an understanding of the environment. Acquired knowledge is then passed to motor structures to perform an action on the environment. Next, sensory structures perceive the altered environment and improve their understanding of the surrounding world. This process of understanding the environment, performing an action correspondingly, and then re-evaluating the initial understanding occurs cyclically in human life. EDC initially mimics this mechanism of introspection by comprehending and refining visual features to acquire the proper knowledge for answering the question. Subsequently, it decodes visual and language features into answer features, feeding them back cyclically to the encoder. In the VRD task, EDC decodes visual features to generate predicate features. We evaluate the proposed framework on the TDIUC, VQA 2.0, and VRD datasets, which outperforms the state-of-the-art models on the TDIUC and VRD datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call