Abstract

The alignment of information from images and questions is of great significance in the visual question answering task. Whether an object in image is related to the question or not is the basic judgement relied on the feature alignment. Many previous works have proposed different alignment methods to build better cross modality interaction. The attention mechanism is the most used method in making alignment. The classical bottom up and top down model builds a top down attention distribution by concatenating question features to each image features and calculates the attention weights between question and image. However, the bottom up and top down model didn't consider the positional information in image and question. In this paper, we revisit the attention distribution from a position perspective which aligns question to object's positional information. We first embed the positional information of each object in image and calculate a position attention distribution to indicate the relevance between objects' positions in context of the current question. Through the attention distribution model can select the related position in image to answer the given question. The position attention distribution is concatenated to the feature attention distribution to get the final distribution. We evaluate our method on visual question answering (VQA2.0) dataset, and show that our method is effective in multimodal alignment.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call