Abstract

Visual internet of things (VIoT) is a cross-modal learning task that requires a simultaneous understanding of image and question text information. The attention mechanism simulates human vision to retain the important part of the information and discard the unimportant part. However, since both important and unimportant features participate in the weighted summation, there is no true retention and discarding of information features. Furthermore, in many VIoT solutions, the utilization of the attention mechanism does not achieve the purpose of improving VIoT performance by aligning the key regions of image and question. To solve the problems mentioned above, this paper proposes an attention mechanism that sparse features to achieve attention with true retention and discard functions. At the same time, a feature sparse co-attention network is constructed to align the key regions of vision and text. The network is composed of image self-attention unit, question self-attention unit, and guiding attention unit. Each self-attention unit has a feature sparse function. These units are cascaded in depth to form a hierarchical structure that, as a whole, realizes co-attention. Several experiments conducted on the VQA-v2 dataset show that the proposed method outperforms the latest methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call