Abstract
In this study, we propose a novel encoder–decoder cycle (EDC) framework inspired by the human learning process called the perception-action cycle to tackle challenging problems such as visual question answering (VQA) and visual relationship detection (VRD). EDC considers the understanding of the visual features of an image as perception and the act of answering the question regarding that image as an action. In the perception-action cycle, information is primarily collected from the environment and then passed to sensory structures in the brain to form an understanding of the environment. Acquired knowledge is then passed to motor structures to perform an action on the environment. Next, sensory structures perceive the altered environment and improve their understanding of the surrounding world. This process of understanding the environment, performing an action correspondingly, and then re-evaluating the initial understanding occurs cyclically in human life. EDC initially mimics this mechanism of introspection by comprehending and refining visual features to acquire the proper knowledge for answering the question. Subsequently, it decodes visual and language features into answer features, feeding them back cyclically to the encoder. In the VRD task, EDC decodes visual features to generate predicate features. We evaluate the proposed framework on the TDIUC, VQA 2.0, and VRD datasets, which outperforms the state-of-the-art models on the TDIUC and VRD datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.