Multimodal Task Research Articles

Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a Multimodal Hallucination Detection Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained annotations on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for hallucination prevention, we optimize InstructBLIP through our novel Fine-grained Direct Preference Optimization (FDPO). We also train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling (RS). We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in InstructBLIP by 41% and 55% respectively. We also find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively, and has strong correlation with human evaluated accuracy scores. The dataset is available at https://github.com/hendryx-scale/mhal-detect.

Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multi-modal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and decoder, respectively. In the prior path, the model solely relies on visual information to predict the target text; whereas in the posterior path, it simultaneously encodes visual information and textual knowledge to reconstruct the target text. The first KL divergence optimizes the conditional variational autoencoder and regularizes the encoder outputs, while the second KL divergence performs a self-distillation from the posterior path to the prior path, ensuring the consistency of decoder outputs.We further enhance the integration of textual information to the posterior path by employing a shared Attention Residual Gaussian Distribution (ARGD), which considers the textual information in the posterior path as a residual component relative to the prior path. Extensive experiments conducted on public datasets demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy. The code and models are available at https://github.com/rzhao-zhsq/CV-SLT.

Multimodal Task Research Articles

Related Topics

Articles published on Multimodal Task

BVA-Transformer: Image-text multimodal classification and dialogue model architecture based on Blip and visual attention mechanism

Semantic Guidance Fusion Network for Cross-Modal Semantic Segmentation.

Layerwised multimodal knowledge distillation for vision-language pretrained model

Prompting Multi-Modal Image Segmentation with Semantic Grouping

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

P-Laplacian Adaptation for Generative Pre-trained Vision-Language Models

Object Attribute Matters in Visual Question Answering

MESED: A Multi-Modal Entity Set Expansion Dataset with Fine-Grained Semantic Classes and Hard Negative Entities

Generative-Based Fusion Mechanism for Multi-Modal Tracking

Stitching Segments and Sentences towards Generalization in Video-Text Pre-training

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Parallel Vertex Diffusion for Unified Visual Grounding

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Detecting and Preventing Hallucinations in Large Vision Language Models

DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification

KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning

Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

Robust image registration for analysis of multisource eye fundus images

Sign-based image criteria for social interaction visual question answering

A Dataset of Electrical Components for Mesh Segmentation and Computational Geometry Research

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multimodal Task Research Articles

Related Topics

Articles published on Multimodal Task

BVA-Transformer: Image-text multimodal classification and dialogue model architecture based on Blip and visual attention mechanism

Semantic Guidance Fusion Network for Cross-Modal Semantic Segmentation.

Layerwised multimodal knowledge distillation for vision-language pretrained model

Prompting Multi-Modal Image Segmentation with Semantic Grouping

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

P-Laplacian Adaptation for Generative Pre-trained Vision-Language Models

Object Attribute Matters in Visual Question Answering

MESED: A Multi-Modal Entity Set Expansion Dataset with Fine-Grained Semantic Classes and Hard Negative Entities

Generative-Based Fusion Mechanism for Multi-Modal Tracking

Stitching Segments and Sentences towards Generalization in Video-Text Pre-training

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Parallel Vertex Diffusion for Unified Visual Grounding

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Detecting and Preventing Hallucinations in Large Vision Language Models

DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification

KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning

Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

Robust image registration for analysis of multisource eye fundus images

Sign-based image criteria for social interaction visual question answering

A Dataset of Electrical Components for Mesh Segmentation and Computational Geometry Research