Q-BENCH: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs.

Zicheng Zhang,Haoning Wu,Erli Zhang,Guangtao Zhai,Weisi Lin

doi:10.1109/tpami.2024.3445770

Abstract

The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related questionanswering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe+ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs. Datasets will be available at https://github.com/Q-Future/Q-Bench.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Q-BENCH: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on pattern analysis and machine intelligence

Lead the way for us

Similar Papers

Depth-aware salient object segmentation
Le Vu Ha ... Tran Hoang Tung
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 36
Le Vu Ha, et. al.Le Vu Ha ... Tran Hoang Tung
07 Oct 2020
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 36

Direct estimation of regional lung volume change from paired and single CT images using residual regression neural network.
Sarah E Gerard ... Eric A Hoffman
Medical Physics | VOL. 50
Sarah E Gerard, et. al.Sarah E Gerard ... Eric A Hoffman
26 Mar 2023
Medical Physics | VOL. 50

Active Sampling Exploiting Reliable Informativeness for Subjective Image Quality Assessment Based on Pairwise Comparison
Zhiwei Fan ... Tiejun Huang
IEEE Transactions on Multimedia | VOL. 19
Zhiwei Fan, et. al.Zhiwei Fan ... Tiejun Huang
01 Dec 2017
IEEE Transactions on Multimedia | VOL. 19

A Perceptually Weighted Rank Correlation Indicator for Objective Image Quality Assessment.
Qingbo Wu ... King N Ngan
IEEE Transactions on Image Processing | VOL. 27
Qingbo Wu, et. al.Qingbo Wu ... King N Ngan
15 May 2017
IEEE Transactions on Image Processing | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Q-BENCH: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on pattern analysis and machine intelligence