Visual Instances Research Articles

Generalized zero-shot learning (GZSL) endeavors to identify the unseen categories using knowledge from the seen domain, necessitating the intrinsic interactions between the visual features and attribute semantic features. However, GZSL suffers from insufficient visual-semantic correspondences due to the attribute diversity and instance diversity. Attribute diversity refers to varying semantic granularity in attribute descriptions, ranging from low-level (specific, directly observable) to high-level (abstract, highly generic) characteristics. This diversity challenges the collection of adequate visual cues for attributes under a uni-granularity. Additionally, diverse visual instances corresponding to the same sharing attributes introduce semantic ambiguity, leading to vague visual patterns. To tackle these problems, we propose a multi-granularity progressive semantic-visual mutual adaption (PSVMA+) network, where sufficient visual elements across granularity levels can be gathered to remedy the granularity inconsistency. PSVMA+ explores semantic-visual interactions at different granularity levels, enabling awareness of multi-granularity in both visual and semantic elements. At each granularity level, the dual semantic-visual transformer module (DSVTM) recasts the sharing attributes into instance-centric attributes and aggregates the semantic-related visual regions, thereby learning unambiguous visual features to accommodate various instances. Given the diverse contributions of different granularities, PSVMA+ employs selective cross-granularity learning to leverage knowledge from reliable granularities and adaptively fuses multi-granularity features for comprehensive representations. Experimental results demonstrate that PSVMA+ consistently outperforms state-of-the-art methods.

Read full abstract

Zero-shot learning (ZSL) aims to recognize objects without seeing any visual instances by learning knowledge transfer between seen and unseen classes. Attributes, which denote high-level visual entities or visual characteristics, have been widely utilized as intermediate embedding space for knowledge transfer in a majority of existing ZSL approaches and have shown impressive performance. Attribute based ZSL approaches, which introduce an intermediate embedding space of attributes for knowledge transfer, have shown impressive performance. However, providing attribute annotations for unseen classes at test time is time-consuming and labor-intensive. Besides, directly using attributes as intermediation (embedding space) for knowledge transfer and zero-shot prediction inevitably leads to the projection domain shift and hubness problems. In this paper, we propose a novel multi-view deep neural network, termed Fusion by Synthesis (FS), which leverages word embeddings of classes as complementary for attributes and performs zero-shot prediction by fusing the word embeddings of unseen classes and the synthesized attributes in the visual feature space. Specifically, in the training phase, by considering the visual features, attributes and word embeddings as three different views of visual instances, FS allocates each view with a denoising auto-encoder to simultaneously ensures robust view-specific reconstructions and cross-view synthesizing, while preserving the discrimination of class labels. During testing, FS can synthesize the absent attributes for unseen classes and fuse them with word embeddings to the visual feature space to perform zero-shot prediction. Besides, FS is flexible to learn with partial views, where either attribute view or word embedding view is missing during training. Moreover, FS is flexible to synthesize either the missing view of attributes or word embeddings from the provided view. Extensive experiments on six benchmark datasets on both image classification and action recognition show that FS is advantaged to fuse multi-view data by synthesis and achieves superior performance compared with the state-of-the-art ZSL methods.

Read full abstract

Visual Instances Research Articles

Articles published on Visual Instances

Enhancing Sound Source Localization via False Negative Elimination.

Region-Focused Network for Dense Captioning

Text-Vision Relationship Alignment for Referring Image Segmentation

“SO FUCKING CUTE IM DYING”: The use of hyperbole on X

PSVMA+: Exploring Multi-Granularity Semantic-Visual Adaption for Generalized Zero-Shot Learning.

Ship-Go: SAR Ship Images Inpainting via instance-to-image Generative Diffusion Models

A two-stage denoising framework for zero-shot learning with noisy labels

Progressive Instance-Aware Feature Learning for Compositional Action Recognition.

Dual-Spatial Normalized Transformer for image captioning

Learning semantic ambiguities for zero-shot learning

A depth information aided real-time instance segmentation method for space task scenarios under CPU platform

Frontispiece: Visual instance segmentation results of T‐element roofs on the testing dataset of RoofNTNU

OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

Two‐stage partial image‐text clustering (TPIT‐C)

Learning Social Relationship From Videos via Pre-Trained Multimodal Transformer

A Panoramic Localizer Based on Coarse-to-Fine Descriptors for Navigation Assistance.

Deep Neighborhood Component Analysis for Visual Similarity Modeling

Visual sentiment analysis via deep multiple clustered instance learning

Structured Two-Stream Attention Network for Video Question Answering

Fusion by synthesizing: A multi-view deep neural network for zero-shot recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Visual Instances Research Articles

Articles published on Visual Instances

Enhancing Sound Source Localization via False Negative Elimination.

Region-Focused Network for Dense Captioning

Text-Vision Relationship Alignment for Referring Image Segmentation

“SO FUCKING CUTE IM DYING”: The use of hyperbole on X

PSVMA+: Exploring Multi-Granularity Semantic-Visual Adaption for Generalized Zero-Shot Learning.

Ship-Go: SAR Ship Images Inpainting via instance-to-image Generative Diffusion Models

A two-stage denoising framework for zero-shot learning with noisy labels

Progressive Instance-Aware Feature Learning for Compositional Action Recognition.

Dual-Spatial Normalized Transformer for image captioning

Learning semantic ambiguities for zero-shot learning

A depth information aided real-time instance segmentation method for space task scenarios under CPU platform

Frontispiece: Visual instance segmentation results of T‐element roofs on the testing dataset of RoofNTNU

OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

Two‐stage partial image‐text clustering (TPIT‐C)

Learning Social Relationship From Videos via Pre-Trained Multimodal Transformer

A Panoramic Localizer Based on Coarse-to-Fine Descriptors for Navigation Assistance.

Deep Neighborhood Component Analysis for Visual Similarity Modeling

Visual sentiment analysis via deep multiple clustered instance learning

Structured Two-Stream Attention Network for Video Question Answering

Fusion by synthesizing: A multi-view deep neural network for zero-shot recognition