Some Can Be Better than All: Multimodal Star Transformer for Visual Dialog

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Visual dialog involves answering questions by analyzing both images and dialogue history. While current multimodal research has effectively modeled the interactions among images, dialogue history, and questions, it incurs significant computational overhead and complexity. To address these challenges, this paper introduces a MultiModal Star Transformer (MMST) that effectively models the interactions between visual and textual modalities, as well as within each modality, with linear computational overhead. MMST utilizes a relay token for each modality, allowing each satellite token to interact with its two adjacent tokens, its previous state, and the two relay tokens. The introduction of relay tokens ensures that every two non-adjacent satellite tokens are two-hop neighbors, thus enabling MMST to support both intramodal long-range connections and intermodal interactions efficiently. Experimental results on the Visdial v0.9 and v1.0 datasets demonstrate that MMST performs comparably to full-attention models.

Similar Papers
  • Conference Article
  • Cite Count Icon 20
  • 10.18653/v1/2021.findings-acl.38
Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation
  • Jan 1, 2021
  • Feilong Chen + 4 more

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.

  • Conference Article
  • Cite Count Icon 7
  • 10.21437/interspeech.2020-2359
TMT: A Transformer-Based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-Aware Dialog
  • Oct 25, 2020
  • Wubo Li + 3 more

Audio Visual Scene-aware Dialog (AVSD) is a task to generate responses when discussing about a given video. The previous state-of-the-art model shows superior performance for this task using Transformer-based architecture. However, there remain some limitations in learning better representation of modalities. Inspired by Neural Machine Translation (NMT), we propose the Transformer-based Modal Translator (TMT) to learn the representations of the source modal sequence by translating the source modal sequence to the related target modal sequence in a supervised manner. Based on Multimodal Transformer Networks (MTN), we apply TMT to video and dialog, proposing MTN-TMT for the video-grounded dialog system. On the AVSD track of the Dialog System Technology Challenge 7, MTN-TMT outperforms the MTN and other submission models in both Video and Text task and Text Only task. Compared with MTN, MTN-TMT improves all metrics, especially, achieving relative improvement up to 14.1% on CIDEr. Index Terms: multimodal learning, audio-visual scene-aware dialog, neural machine translation, multi-task learning

  • Book Chapter
  • 10.1007/978-981-99-8429-9_13
RecFormer: Recurrent Multi-modal Transformer with History-Aware Contrastive Learning for Visual Dialog
  • Dec 24, 2023
  • Liucun Lu + 5 more

RecFormer: Recurrent Multi-modal Transformer with History-Aware Contrastive Learning for Visual Dialog

Save Icon
Up Arrow
Open/Close