Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Export
Sort by: Relevance
  • New
  • Research Article
  • 10.1109/tpami.2026.3661049
Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization.
  • Feb 4, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • De Cheng + 5 more

Domain Generalization (DG) seeks to develop models that perform well on unseen target domains by learning domain-invariant representations. Recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have shown strong potential for enhancing DG through prompt tuning. However, existing VFM-based prompt tuning methods often focus on task-specific adaptation rather than disentangling domain invariant features, leaving cross-domain generalization insufficiently explored. In this paper, we address this challenge by fully leveraging the controllable and flexible language prompt in VFMs. Observing that the text modality is inherently rich in semantics and easier to disentangle, we propose a novel frame work termed Prompt Disentanglement via Language Guidance and Representation Alignment (PADG). PADG first employs a large language model (LLM) to disentangle textual prompts into domain-invariant and domain-specific components, which then guide the learning of domain-invariant visual representations. To complement the limitations of text-only guidance, we further introduce the Worst Explicit Representation Alignment (WERA) module, which enhances visual invariance by simulating bounded domain shifts through learnable stylization prompts and aligning representations between original and perturbed samples. Extensive experiments on mainstream DG benchmarks, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that PADG consistently outperforms existing state of-the-art methods, validating its effectiveness in robust domain invariant representation learning. The code is available at: https://anonymous.4open.science/r/paper-5403/.

  • New
  • Research Article
  • 10.1109/tpami.2026.3656175
Evaluating and Mitigating Relationship Hallucinations in Large Vision-Language Models.
  • Feb 3, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • Mingrui Wu + 5 more

The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark specifically designed to evaluate hallucinations in visual relationships. R-Bench includes both image-level questions to assess the existence of relationships and instance-level questions that probe deeper into local visual comprehension. Our analysis reveals that relationship hallucinations arise from three types of co-occurrences: relationship-relationship, subject-relationship, and relationship-object, exacerbated by the long-tail distribution in visual datasets. Moreover, LVLMs often ignore visual content, over-relying on common sense from language models, particularly in spatial reasoning tasks. We further demonstrate that region-level image-text alignment helps mitigate relationship hallucinations and propose a new baseline, Region-Aware Alignment Mitigation (RA2M), that enhances model attention to relevant regions, improving alignment between generated text and images.

  • New
  • Research Article
  • 10.1109/tpami.2026.3660754
Decoupled Hierarchical Distillation for Multimodal Emotion Recognition.
  • Feb 3, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • Yong Li + 5 more

Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality's features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3%/2.4% (ACC$_{7}$), 1.3%/1.9% (ACC$_{2}$) and 1.9%/1.8% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.

  • New
  • Research Article
  • 10.1109/tpami.2026.3660569
An Algebraic Geometry Approach to Viewing Graph Solvability.
  • Feb 3, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • Federica Arrigoni + 3 more

The concept of viewing graph solvability has gained significant interest in the context of structure-from-motion. A viewing graph is a mathematical structure where nodes are associated with cameras and edges represent the epipolar geometry connecting overlapping views. Solvability studies under which conditions the cameras are uniquely determined by the graph. In this paper we propose a novel framework for analyzing solvability problems based on algebraic geometry, demonstrating its potential in understanding structure-from-motion graphs and proving a conjecture that was previously proposed.

  • New
  • Research Article
  • 10.1109/tpami.2026.3660934
Full-Scope Vectorization of Geographical Elements from Large-Size Remote Sensing Imagery.
  • Feb 3, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • Yansheng Li + 7 more

Large-size very-high-resolution (VHR) remote sensing imagery has emerged as a critical data source for high-precision vector mapping of multi-scale geographical elements such as building, water, road and etc. When dealing with the large-size image, due to the limited memory of GPU, the deep learning-based vector mapping methods often employ the sliding block strategy. This inevitably leads to the degenerated performance because of the stitching difficulty of the sliding blocks' vector mapping results. Therefore, it is necessary to conduct full-scope vector mapping via mining the consistent cue in large-size remote sensing imagery. To this end, this paper presents a novel global context-aware local point optimization method. To leverage the global context, this paper proposes a novel pyramid fusion network (PFNet) to conduct semantic segmentation of the large-size image in an end-to-end manner. Under the constraint of the global semantic segmentation result, a new inflection-point perception network (IPNet) is proposed to generate a set of stable points to depict the boundary of each element. Extensive experiments on building, water and road datasets, where each image has over 100 million pixels, show that our method obviously outperforms the existing methods. The project page is at https://li-99.github.io/project/Vectorization.html.

  • New
  • Research Article
  • 10.1109/tpami.2026.3660366
Top-$k$ Feature Selection in Sparse Learning via Accelerated Coordinate Descent Method.
  • Feb 3, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • Han Zhang + 3 more

Top-$k$ feature selection in sparse learning is a fundamental problem in machine learning. It is difficult to conquer due to the rigid $\ell _{2,0}$-norm constraint. Existing literature mostly relaxes the constraint and seeks the approximation of the selection matrix, degenerating primitive models and missing the genuine solutions. This research tackles the primitive top-$k$ feature selection model in sparse learning. From the perspective of universality, we investigate both supervised and semi-supervised models of top-$k$ feature selection in sparse learning. By disassembling the feature selection matrix, it is revealed that two different objectives could be unified into one general ratio-trace problem, which is a non-convex optimization problem. The accelerated coordinate descent method is raised to efficiently solve the non-convex objective, through which the local optimal solution of top-$k$ feature indices is obtained with a competitive time cost. To verify the proposed algorithm, we design toy experiments that could visualize the advantages of the selected features. Meanwhile, experimental results on nine normal datasets and the large-scale ImageNet dataset comprehensively show the superiority of our methods compared to representative and state-of-the-art supervised and semi-supervised algorithms.

  • New
  • Research Article
  • 10.1109/tpami.2026.3660699
Generalized Regularized Evidential Deep Learning Models: Theory and Comprehensive Evaluation.
  • Feb 3, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • Deep Shankar Pandey + 2 more

Evidential deep learning (EDL) models, based on Subjective Logic, introduce a principled and computationally efficient way to make deterministic neural networks uncertainty-aware. The resulting evidential models can quantify fine-grained uncertainty using learned evidence. However, the Subjective-Logic framework constrains evidence to be non-negative, requiring specific activation functions whose geometric properties can induce activation-dependent learning-freeze behavior-a regime where gradients become extremely small for samples mapped into low-evidence regions. We theoretically characterize this behavior and analyze how different evidential activations influence learning dynamics. Building on this analysis, we design a general family of activation functions and corresponding evidential regularizers that provide an alternative pathway for consistent evidence updates across activation regimes. Extensive experiments on four benchmark classification problems (MNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet), two few-shot classification problems, and blind face restoration problem empirically validate the developed theory and demonstrate the effectiveness of the proposed generalized regularized evidential models.

  • New
  • Research Article
  • 10.1109/tpami.2026.3660863
Allies Teach Better than Enemies: Inverse Adversaries for Robust Knowledge Distillation.
  • Feb 3, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • Junhao Dong + 3 more

Adversarially robust knowledge distillation aims to compress a large-scale robust teacher model into a lightweight student counterpart while preserving adversarial robustness and natural performance. Previous methods primarily focused on aligning knowledge (e.g., predictions) between teacher and student models to transfer robustness. However, potentially incorrect predictions from the teacher can misguide the student, negatively impacting robustness transfer. To circumvent this, we propose a novel adversarially robust knowledge distillation scheme that promotes alignment towards more benign predictions rather than incorrect ones by refining inputs into so-called "inverse adversarial examples" via simply reversing the sign of adversarial perturbation. Through a comprehensive investigation of the properties of inverse adversaries, we provide new theoretical insights showing how mimicking the behavior of the teacher model on inverse adversaries facilitates reliable robustness transfer built upon the implicit connection between robustness and the input gradient information. We thus design a gradient matching mechanism between teacher and student models utilizing inverse adversaries to facilitate robust knowledge alignment. Furthermore, inspired by our analysis of the correlation between robustness and adversarial transferability, we propose a weight-space disruption strategy that jointly interacts with both teacher and student models to find a shared direction for better robustness transfer. Empirical evaluations across various datasets demonstrate that our method achieves state-of-the-art robustness and natural performance. Notably, on ImageNet, our approach outperforms prior methods by approximately 3.8% in both clean and robust accuracy. Moreover, we show that incorporating auxiliary generated data into distillation further boosts robustness. Our method can also be generalized to multimodal architectures.

  • New
  • Research Article
  • 10.1109/tpami.2026.3653806
Searching to Modulate for Cold-Start Recommendation.
  • Feb 3, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • Shiguang Wu + 2 more

Making personalized recommendation for cold-start users, who only have a few interaction histories, is a challenging problem in recommendation systems. Recent works leverage hypernetworks to directly map interaction histories to user-specific parameters, which are then used to modulate predictor by certain modulation structure. These works obtain the state-of-the-art performance. However, there lacks a general approach to design the modulation structure. Instead of using a fixed modulation function and deciding modulation position by expertise, we propose to determine proper modulation structure, including function and position, via neural architecture search. We propose two approaches. We first design a symbolic search space which covers broad models and theoretically prove that this search space can be transformed to a much smaller space, enabling an efficient and robust one-shot search algorithm, called ColdNAS. Since recommendation systems are a special case of bipartite matching problems, the proposed methods can be generalized to a wide range of cold-start tasks, such as disease-gene association prediction for emerging diseases. However, diverse scenarios introduce new challenges in both the flexibility of the search algorithm and the search space. To address these limitations, we further propose ColdNAS$_+$, where we employ neural networks to model modulation functions to extend search space and design a two-stage decoupled stochastic search algorithm to enable non-differentiable targets in continuous spaces. Extensive experimental results on benchmark datasets show that modulation structures obtained by ColdNAS and ColdNAS$_+$ consistently outperform hand-designed cold-start techniques for recommending items for new users and predicting associated genes for new disease. We observe that different modulation functions lead to the best performance on different datasets or under different metrics, which validates the necessity of designing the modulation structure in a data-driven way.

  • New
  • Research Article
  • 10.1109/tpami.2026.3660922
Towards Real-world Holistic Privacy-Preserving Person Re-identification.
  • Feb 3, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • Qianxiang Meng + 3 more

Real-world person re-identification (Re-ID) systems are susceptible to malicious attacks, leading to the leakage of pedestrian images and the Re-ID model, posing severe threats to the privacy of both system owners and pedestrians. Existing privacy-preserving person re-identification (PPPR) methods fail to simultaneously resist data leakage, model leakage, and data & model leakage while compromising the normal functionality of Re-ID systems. In this paper, we begin with an in-depth analysis of prior methodologies and identify the gap between existing works and the ideal PPPR paradigm. Inspired by the concept of 'Let the invisible perturbation become the system trigger', we propose SHIELD, a pioneering and comprehensive two-stage privacy-preserving framework. To resist data leakage, we propose a self-supervised method for Protected Dataset Generation in the first stage, which obviates the dependence on identity labels and ensures image quality. To resist model leakage without compromising the normal retrieval accuracy, we propose Original Feature Deconstruction and Protected Feature Alignment to train the system model with paired protected and original images. Extensive experiments substantiate that SHIELD significantly outperforms existing PPPR methods, offering robust and holistic protection for Re-ID systems while maintaining decent retrieval accuracy for authorized users. The code will be released soon.