Abstract

Visual question answering is a multimodal task that interacts a given image with the corresponding natural language question to get the final answer. Traditional visual question answer models use region-based top-down image feature representations. This approach causes regional features to lose their contextual connection to global features, resulting in the underutilization of the global semantic features of visual features. To solve this problem, it is necessary to enhance the relationships between regions and between regions and the global to obtain more accurate visual feature representations, which can better correlate with corresponding question texts. Therefore, this paper proposes a multi-level visual feature enhancement method (MLVE). It mainly consists of the separated visual feature representation module (SVFR) and the joint visual feature representation module (JVFR). The graph attention neural network is an important part of the two modules to enhance the relationship between regions and between regions and the global. These two modules can learn different levels of visual semantic relationships to provide richer visual feature representations. The effectiveness of this scheme is verified on the VQA2.0 dataset.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.