Vision-language Tasks Research Articles

Visual language tasks such as Visual Question Answering (VQA) or Visual Entailment (VE) require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating Inner Monologue, a cognitive process in which an individual engages in silent verbal communication with themselves. More specifically, we enable LLMs and VLMs to interact through natural language conversation (i.e., Inner Monologue) and propose to use a two-stage training process to learn how to do Inner Monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and achieves competitive performance with less training data when compared with state-of-the-art models while concurrently keeping the interpretability. The results suggest that by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, broadening its potential applications across various AI challenges beyond vision and language tasks.

The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code are available at https://opendatalab.github.io/VIGC

Vision-language Tasks Research Articles

Articles published on Vision-language Tasks

A fine-grained deconfounding study for knowledge-based visual dialog

GEXMERT: Geometrically enhanced cross-modality encoder representations from transformers inspired by higher-order visual percepts

Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.

Vision-language pre-training via modal interaction

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

X 2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks.

Adventures of Trustworthy Vision-Language Models: A Survey

Continual Vision-Language Retrieval via Dynamic Knowledge Rectification

Tackling Vision Language Tasks through Learning Inner Monologues

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network

GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

VIGC: Visual Instruction Generation and Correction

FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning

Enhancing multi-modal fusion in visual dialog via sample debiasing and feature interaction

Self-Paced Multi-Grained Cross-Modal Interaction Modeling for Referring Expression Comprehension.

Toward Robust Referring Image Segmentation.

Zero-shot Scene Graph Generation via Triplet Calibration and Reduction

BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning

Heterogeneous Knowledge Network for Visual Dialog

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Vision-language Tasks Research Articles

Articles published on Vision-language Tasks

A fine-grained deconfounding study for knowledge-based visual dialog

GEXMERT: Geometrically enhanced cross-modality encoder representations from transformers inspired by higher-order visual percepts

Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.

Vision-language pre-training via modal interaction

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

X 2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks.

Adventures of Trustworthy Vision-Language Models: A Survey

Continual Vision-Language Retrieval via Dynamic Knowledge Rectification

Tackling Vision Language Tasks through Learning Inner Monologues

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network

GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

VIGC: Visual Instruction Generation and Correction

FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning

Enhancing multi-modal fusion in visual dialog via sample debiasing and feature interaction

Self-Paced Multi-Grained Cross-Modal Interaction Modeling for Referring Expression Comprehension.

Toward Robust Referring Image Segmentation.

Zero-shot Scene Graph Generation via Triplet Calibration and Reduction

BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning

Heterogeneous Knowledge Network for Visual Dialog