Abstract

The multi-object counting in visual question answering (VQA) is still a challenging problem. Existing VQA models mainly adopt object detection network to extract image features and combine soft attention mechanism to further increase the model accuracy. However, repeated counting of the same object may occur when the object detection network extracts image features. In addition, the sum of attention weights of all objects calculated by soft attention mechanism is 1, which leads to the constant quantity information of objects being 1. We propose a new counting attention mechanism based on classification confidence. The main idea is to calculate the initial attention with sigmoid function and similarity with the object location generated by object detection network; we introduce classification confidence to calculate a more accurate similarity and solve the problem that the quantity information under existing soft attention mechanism is always 1. The experiment compares the proposed counting attention mechanism with the baseline model and the related work under the VQA v2 dataset. The results show that the counting attention mechanism improves the counting accuracy by 6.4% compared with the baseline model and surpasses most VQA models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.