Multimodal Interaction Research Articles

Visual grounding (VG) locates target objects in visual scenes by understanding given natural language queries. Current methods for VG mainly focus on grounding referring expressions or noun phrases covered in the labeled training samples. Despite their grounding prowess, these approaches struggle in grounding novel query–image pairs excluded from the training data. This shortage is usually caused by the deficiency of discriminative representation learning both in images and queries. To address these issues, we propose a one-stage coarse-to-fine framework for zero-shot VG to ground novel query-image samples. Specifically, in the coarse stage, we mine the global context information in the visual features and query embeddings by employing a multi-head self-attention block, strengthening the intra-modality relations in the visual and textual features. In the fine stage, we first learn the query-aware visual representations based on the acquired global context information via a multi-modal relation-enhanced transformer block, which explores the informative information by modeling the cross-modal interaction among visual and textual domains. We further excavate target-oriented discriminative representations from the acquired query-aware visual representations by a noun phrase-guided multi-modal interaction network, which augments the interaction between target-related phrases and the obtained query-aware visual representations to enhance the distinction of target regions, enhancing the subsequent referred target grounding and generalizing. In order to validate the proposed approach, we implement extensive experiments and ablation studies on public benchmark datasets, including RefCOCO, RefCOCO+, RefCOCOg, Flickr30K Entity, Flickr-Split-0 and Flickr-Split-1. Experimental results demonstrate that our approach substantially improves the grounding accuracy and achieves new state-of-the-art performance under single-stage training and testing.

Read full abstract

Today, Smart Assistants (SAs) are supported by significantly improved Natural Language Processing (NLP) and Natural Language Understanding (NLU) engines as well as AI-enabled decision support, enabling efficient information communication, easy appliance/device control, and seamless access to entertainment services, among others. In fact, an increasing number of modern households are being equipped with SAs, which promise to enhance user experience in the context of smart environments through verbal interaction. Currently, the market in SAs is dominated by products manufactured by technology giants that provide well designed off-the-shelf solutions. However, their simple setup and ease of use come with trade-offs, as these SAs abide by proprietary and/or closed-source architectures and offer limited functionality. Their enforced vendor lock-in does not provide (power) users with the ability to build custom conversational applications through their SAs. On the other hand, employing an open-source approach for building and deploying an SA (which comes with a significant overhead) necessitates expertise in multiple domains and fluency in the multimodal technologies used to build the envisioned applications. In this context, this paper proposes a methodology for developing and deploying conversational applications on the edge on top of an open-source software and hardware infrastructure via a multilayer architecture that simplifies low-level complexity and reduces learning overhead. The proposed approach facilitates the rapid development of applications by third-party developers, thereby enabling the establishment of a marketplace of customized applications aimed at the smart assisted living domain, among others. The supporting framework supports application developers, device owners, and ecosystem administrators in building, testing, uploading, and deploying applications, remotely controlling devices, and monitoring device performance. A demonstration of this methodology is presented and discussed focusing on health and assisted living applications for the elderly.

Read full abstract

Multimodal Interaction Research Articles

Related Topics

Articles published on Multimodal Interaction

Fish behavior recognition based on an audio-visual multimodal interactive fusion network

Global-aware Fragment Representation Aggregation Network for image-text retrieval

Virtual Obstacle Avoidance Strategy: Navigating through a Complex Environment While Interacting with Virtual and Physical Elements.

The complexity of interpersonal physiology during rupture and repair episodes in the treatment of borderline personality disorder: a proof-of-concept multimethod single case study of verbal and non-verbal interactional dynamics.

Referring Image Segmentation with Multi-Modal Feature Interaction and Alignment Based on Convolutional Nonlinear Spiking Neural Membrane Systems.

Multilevel attention imitation knowledge distillation for RGB-thermal transmission line detection

Specific Emitter Identification Algorithm Based on Time–Frequency Sequence Multimodal Feature Fusion Network

Zero-shot visual grounding via coarse-to-fine representation learning

NARRATIVE MULTIMODAL TEXT AS A TYPE OF MULTIMODAL TEXT

Robust Stochastic Model Predictive Control for Autonomous Vehicle Motion Planning

Learnable and Expressive Visualization Authoring through Blended Interfaces.

Effects of Self-Learning and Exploration for XR-based Interactions

Robot Control Platform for Multimodal Interactions with Humans Based on ChatGPT

Translanguaging space through pointing gestures: Multilingual family literacy at a science museum

Who are the best contributors? Designing a multimodal science communication interface based on the ECM, TAM and the Taguchi methods

A tactile oral pad based on carbon nanotubes for multimodal haptic interaction

MRehab: A Mixed Reality Rehabilitation System Supporting Integrated Speech and Hand Training

Building and coordinating multimodal interaction with gaze across two communicative situations

Analysing educational dialogue around shared artefacts in technology-mediated contexts: a new coding framework

A Multilayer Architecture towards the Development and Distribution of Multimodal Interface Applications on the Edge.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multimodal Interaction Research Articles

Related Topics

Articles published on Multimodal Interaction

Fish behavior recognition based on an audio-visual multimodal interactive fusion network

Global-aware Fragment Representation Aggregation Network for image-text retrieval

Virtual Obstacle Avoidance Strategy: Navigating through a Complex Environment While Interacting with Virtual and Physical Elements.

The complexity of interpersonal physiology during rupture and repair episodes in the treatment of borderline personality disorder: a proof-of-concept multimethod single case study of verbal and non-verbal interactional dynamics.

Referring Image Segmentation with Multi-Modal Feature Interaction and Alignment Based on Convolutional Nonlinear Spiking Neural Membrane Systems.

Multilevel attention imitation knowledge distillation for RGB-thermal transmission line detection

Specific Emitter Identification Algorithm Based on Time–Frequency Sequence Multimodal Feature Fusion Network

Zero-shot visual grounding via coarse-to-fine representation learning

NARRATIVE MULTIMODAL TEXT AS A TYPE OF MULTIMODAL TEXT

Robust Stochastic Model Predictive Control for Autonomous Vehicle Motion Planning

Learnable and Expressive Visualization Authoring through Blended Interfaces.

Effects of Self-Learning and Exploration for XR-based Interactions

Robot Control Platform for Multimodal Interactions with Humans Based on ChatGPT

Translanguaging space through pointing gestures: Multilingual family literacy at a science museum

Who are the best contributors? Designing a multimodal science communication interface based on the ECM, TAM and the Taguchi methods

A tactile oral pad based on carbon nanotubes for multimodal haptic interaction

MRehab: A Mixed Reality Rehabilitation System Supporting Integrated Speech and Hand Training

Building and coordinating multimodal interaction with gaze across two communicative situations

Analysing educational dialogue around shared artefacts in technology-mediated contexts: a new coding framework

A Multilayer Architecture towards the Development and Distribution of Multimodal Interface Applications on the Edge.