Counting Attention Based on Classification Confidence for Visual Question Answering

Mingqin Chen,Yilei Wang,Yingjie Wu,Shan Chen

doi:10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00167

Abstract

The multi-object counting in visual question answering (VQA) is still a challenging problem. Existing VQA models mainly adopt object detection network to extract image features and combine soft attention mechanism to further increase the model accuracy. However, repeated counting of the same object may occur when the object detection network extracts image features. In addition, the sum of attention weights of all objects calculated by soft attention mechanism is 1, which leads to the constant quantity information of objects being 1. We propose a new counting attention mechanism based on classification confidence. The main idea is to calculate the initial attention with sigmoid function and similarity with the object location generated by object detection network; we introduce classification confidence to calculate a more accurate similarity and solve the problem that the quantity information under existing soft attention mechanism is always 1. The experiment compares the proposed counting attention mechanism with the baseline model and the related work under the VQA v2 dataset. The results show that the counting attention mechanism improves the counting accuracy by 6.4% compared with the baseline model and surpasses most VQA models.

Full Text