Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Export
Sort by: Relevance
  • New
  • Research Article
  • 10.1145/3806389
Diff-Oracle: Learning Styles and Contents to Augment Realistic Oracle Characters in Diffusion Model
  • Apr 28, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Jing Li + 5 more

Recognizing oracle bone scripts plays an important role in Chinese archaeology and philology. However, a significant challenge remains because of the scarcity of oracle character images. To overcome this issue, we propose Diff-Oracle, a novel multi-modal conditional diffusion model that generates a diverse range of controllable oracle characters by inputting random combinations of references. Given the challenge of accurately describing oracle character styles using natural language, Diff-Oracle departs from traditional diffusion models that rely primarily on text prompts by introducing a style encoder. This encoder extracts style prompts from existing oracle character images, where style details are converted into a text embedding format via a pre-trained language-vision model. Additionally, given the lack of explicit content information for oracle characters, ensuring that generated characters accurately represent the intended glyphs is challenging. Therefore, we pre-generate pixel-level paired oracle character images (i.e., style and content images) by an image-to-image translation model, providing content information for the generation process. Meanwhile, Diff-Oracle integrates a content encoder designed to capture specific content details from content reference images. Extensive experiments on Oracle-241 and OBC306 datasets demonstrate that Diff-Oracle significantly outperforms existing generative methods in image quality and diversity. Moreover, Diff-Oracle substantially benefits downstream recognition tasks, outperforming all existing state-of-the-art methods by a large margin. In particular, on the challenging OBC306 dataset, Diff-Oracle achieves a 7.70% accuracy gain in the zero-shot setting and reaches 84.62% accuracy for unseen oracle characters, setting a new benchmark for oracle character recognition. The code is available at https://github.com/JJJingLi/Diff-Oracle.

  • New
  • Research Article
  • 10.1145/3810953
Lingo2Action: Fusing Semantic Risk and Perceptual Uncertainty for Adaptive 3D Value Maps
  • Apr 28, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Zhi Wang + 7 more

While LLMs such as GPT, LLaMA, and Gemini have achieved significant progress in language understanding and reasoning, their capability to act reliably within the physical world remains constrained. Existing approaches predominantly rely on static spatial representations, which fail to capture the semantic risks embedded within language instructions and the perceptual uncertainties inherent in real-world sensing. To address these limitations, this work introduces Lingo2Action, a unified semantic-driven robotic framework based on the LAM paradigm. This framework incorporates a Semantic-Perceptual Adaptive Field (SPAF) that dynamically integrates Semantic-Aware Parameterization (SAP) derived from the LLM with Risk-Aware Perception (RAP) generated by the perception module. Through this integration, the system jointly encodes task semantics, environmental risks, and perceptual confidence into an Adaptive Semantic Value Map (ASVM). The ASVM directs an enhanced context-aware path planner to generate safe and efficient trajectories. Empirical evaluations demonstrate that, within tabletop manipulation scenarios characterized by varying semantic risks and perceptual uncertainties, the Lingo2Action framework equipped with the SPAF engine consistently improves both operational safety and task success rates.

  • New
  • Research Article
  • 10.1145/3811910
Relation-Aware Proxy Hashing for Cross-Modal Retrieval
  • Apr 28, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Jinyu Xu + 5 more

Proxy hashing methods have attracted increasing attention in cross-modal retrieval, because they are able to learn the mapping of different modalities into a common low-dimensional hash space by leveraging global proxies. However, existing approaches typically suffer from two limitations: (1) They solely utilize prior labels to capture the global semantic information, and hence lack the ability to explore necessary fine-grained semantic information to bridge the modality gap effectively. (2) They seldom consider the guidance information of intrinsic semantic similarity on proxy-centered space, and thus fail to leverage the similarity relations among instances sufficiently. To mitigate these limitations, we propose RAPH, a novel Relation-Aware Proxy Hashing framework that learns the semantic relations between different modalities and different semantic levels to enhance the discriminative capability of hash codes. Specifically, we first propose a Local Semantic Interaction (LSI) module based on masked language modeling to achieve the interaction of multi-modal fine-grained semantic features. Secondly, a relation-aware hashing learning scheme is designed to simultaneously explore the intrinsic semantic relationships and the global semantic information based on proxies. This is achieved by minimizing the reconstruction error between the multi-modal affinity matrices derived from learned features and the cross-modal similarity matrix of the hash codes. The proposed framework is able to learn more discriminative hash codes and achieves superior performance to many baselines on three public datasets.

  • New
  • Research Article
  • 10.1145/3797274
Learning to Discern Fine-Grained Cues Across Domains: Generalizing Re-ID via Multi-Level Feature Propagation
  • Apr 24, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Xu Zhang + 3 more

Domain generalizable person Re-identification methods (DG person ReID) often focus on aligning source-domain distributions to learn domain-invariant features. However, they commonly overlook the pivotal role of hard samples and subtle yet important appearance differences across domains in refining decision boundaries, leading to suboptimal performance when encountering visually similar pedestrians. To address these issues, we propose a framework that combines adaptive uncertainty-driven hard identity mining with a multi-level feature propagation strategy. Specifically, we first identify and prioritize challenging samples across domains through an uncertainty estimation mechanism coupled with dynamic cross-domain weight allocation, thereby adaptively enhancing the model’s discriminative capability for confusable identities. We then incorporate both cross-domain and intra-domain feature propagation to integrate fine-grained information across multiple source domains, strengthening the model’s adaptability to unseen target domains. Extensive experiments on nine real-world benchmarks demonstrate that our method consistently outperforms state-of-the-art DG person ReID approaches, achieving up to 5.3% mAP and 6.6% Rank-1 improvements over the leading baselines.

  • New
  • Research Article
  • 10.1145/3803012
One-shot Face Sketch Synthesis in-the-Wild via Generative Diffusion Prior and Instruction Tuning
  • Apr 21, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Han Wu + 5 more

Face sketch synthesis is a technique aimed at converting face photos into sketches. Existing face sketch synthesis research mainly relies on training with numerous photo–sketch sample pairs from existing datasets. However, these large-scale discriminative learning methods will have to face problems, such as data scarcity and high human labor costs. Once the training data become scarce, their generative performance significantly degrades. In this article, we propose a one-shot face sketch synthesis method based on diffusion models. We optimize text instructions on a diffusion model using face photo–sketch image pairs. Then, the instructions derived through gradient-based optimization are used for inference. To simulate real-world scenarios more accurately and evaluate method effectiveness more comprehensively, we introduce a new benchmark named One-shot Face Sketch Dataset (OS-Sketch). The benchmark consists of 400 pairs of face photo–sketch images, including sketches with different styles and photos with different backgrounds, ages, sexes, expressions, illumination, and so on. For a solid out-of-distribution evaluation, we select only one pair of images for training at each time, with the rest used for inference. Extensive experiments demonstrate that the proposed method can convert various photos into realistic and highly consistent sketches in a one-shot context. Compared to other methods, our approach offers greater convenience and broader applicability. The dataset will be available at: https://github.com/HanWu3125/OS-Sketch .

  • New
  • Research Article
  • 10.1145/3796237
CIDER: Collaborative Interactive Dynamic Environments for eXtended Reality
  • Apr 21, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Hung-Jui Guo + 3 more

Remote collaboration systems based on physical environments face several critical challenges, including data-heavy virtual representations and high latencies during data acquisition, reconstruction, rendering, and transmission. Existing approaches often suffer from significant latency, making them unsuitable for real-time collaboration, rely on static scenes that limit interaction, and require multiple specialized hardware, restricting accessibility. To address these challenges, we present Collaborative Interactive Dynamic Environments for eXtended Reality (CIDER)—the first eXtended Reality (XR) platform to integrate Mixed Reality (MR) for co-located users and Virtual Reality (VR) for remote participants through a fully automated pipeline that replicates entire physical environments. CIDER dynamically transforms a user’s physical space into an interactive virtual environment, shareable with remote collaborators within seconds. It employs an efficient approach to represent, render, distribute, and synchronize virtual scenes, achieving interaction latencies of 0.22 seconds, about 10 times lower than comparable systems (2.4 seconds). We evaluate CIDER’s performance quantitatively with collaboration-oriented metrics in scenarios where participants are separated by up to 12,000 km. We also conducted a questionnaire-based user study with 17 participants to evaluate usability and overall user experience. Furthermore, CIDER allows collaborators to participate using a broad range of devices, including personal computers (via Unity emulators, functioning similarly to a MR/VR device), MR devices (e.g., HoloLens 2), and VR devices (e.g., Meta Quest 2 and 3), enhancing accessibility and usability for diverse user groups.

  • New
  • Research Article
  • 10.1145/3801549
PSO-HEAD: Pseudo-Supervision Guided Spatial Optimization for View-Consistent 3D Full-Head Reconstruction
  • Apr 21, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Ping Zhang + 5 more

Generative 3D human head reconstruction in \(360^{\circ}\) is attracting increasing attention because of its flexibility in downstream animation applications. Existing generative 3D head synthesis approaches are primarily limited to near-frontal face priors, which cause distorted artifacts at large view angles. In this article, we introduce a novel Pseudo-Supervision Guided Spatial Optimization (PSO-HEAD) framework that reconstructs 3D view-consistent full-head through explicitly introducing pseudo-label of back-head supervision for spatial texture and geometric optimization. Particularly, our PSO-HEAD introduces two key improvements, i.e., Pseudo-Supervision Augmented Inversion (PSA-Inversion) and Full-Head Aware Generative Enhancement (FAGE). PSA-Inversion augments plausible invisible back-head as pseudo-supervision to optimize the view-hallucinated latent code conditioned on the augmented camera poses via GAN inversion, enforcing 3D spatial consistency across both visible and invisible regions. Furthermore, FAGE fine-tunes the 3D GAN on a proposed auxiliary FK-Enhance dataset deriving from either generated or real-world high-quality back-head images, which therefore improves the generalization of our PSO-HEAD to diverse hairstyles or underrepresented regions. Benefiting from the improvements, our PSO-HEAD enables efficient \(360^{\circ}\) view-consistent full-head generation from single input images, particularly improving reconstruction fidelity of unobserved regions, which quantitatively and qualitatively outperforms the state-of-the-art methods.

  • New
  • Research Article
  • 10.1145/3803423
Unsupervised Dehazing of Real-World Images with Frequency-Aware Learning
  • Apr 21, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Minglong Xue + 3 more

Image dehazing remains a challenging task in real-world scenarios. Unlike synthetic datasets, real-world hazy images often exhibit nonuniform atmospheric degradation, severe haze accumulation, and significant texture loss, making it difficult for existing methods to restore realistic appearances and fine details consistently. To address these challenges, we propose a frequency-aware learning-based unpaired image dehazing network for real-world hazy scenes. First, we propose a content-aware state space modeling paradigm, focus on details through high-frequency enhancement, and incorporate global contextual understanding at the encoding stage, enabling adaptive representation of complex textures. Second, to effectively handle spatially nonuniform degradation, a hazy density estimation module is designed to guide multiple expert-gated feedback units, which dynamically select feature fusion paths. Finally, we propose a contour-guided differentiable frequency domain enhancement mechanism to explicitly recover edge and texture details in degraded regions. Extensive experiments on real-world hazy datasets under unsupervised settings demonstrate that our method achieves competitive performance, validating its effectiveness and strong practical potential under complex atmospheric conditions. The code is available at https://github.com/Fan-pixel/FAL-Net .

  • New
  • Research Article
  • 10.1145/3801156
SOR-BDNet: Semantic-Optical Representation for Boundary-Aware Video Anomaly Detection with GPT-4o
  • Apr 21, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Yi Sun + 5 more

In recent years, Video Anomaly Detection (VAD) has shifted from conventional appearance-based modeling to semantically driven frameworks empowered by LLMs. Traditional reconstruction- and prediction-based methods, relying on motion or appearance patterns learned from normal data, often misclassify previously unseen yet semantically normal events as anomalies. To address this limitation, we propose SOR-BDNet (Semantic-Optical Representation with Boundary Detection Network), an annotation-free multimodal VAD framework that jointly leverages visual appearance and motion dynamics to generate interpretable semantic representations at the frame level. Specifically, we employ RAFT to estimate dense motion fields and concatenate the resulting flow maps with RGB images to form unified spatiotemporal inputs. These fused representations are fed into a GPT-4o-based module that generates semantic captions capturing object semantics and motion cues. Anomalies are detected by measuring semantic deviations from a memory bank constructed from normal captions. To further refine temporal boundaries, we design a boundary refinement module that integrates visual continuity constraints with contrastive feature learning based on a Swin Transformer backbone. Extensive experiments on four challenging benchmarks—UCSD-Ped2, Avenue, ShanghaiTech, and UCF-Crime—demonstrate that SOR-BDNet achieves frame-level accuracies of 97.96%, 82.86%, 87.36%, and 85.64%, respectively. These results highlight the robustness and scalability of the proposed framework, while significantly improving interpretability and generalization across diverse real-world surveillance scenarios. The source code and pretrained models are available at https://github.com/syi-coder/SOR-BDNet-Semantic-Optical-Representation-for-Boundary-Aware-Video-Anomaly-Detection-with-GPT-4o .

  • New
  • Research Article
  • 10.1145/3799429
Transition-aware Path and Direction Variation Modeling for Gaze Target Detection in Video
  • Apr 21, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Xingming Yang + 5 more

Gaze target detection aims to localize a person’s gaze target. During gaze transition in video, the absence of accurate temporal variation modeling (TVM) may lead to errors in gaze target localization. In this work, we propose a Transition-aware Gaze Model (TGM), which focuses on analyzing temporal differences to achieve accurate location variation modeling. The TGM contains four key components: a frame gaze model, and three transition-aware modules (path variation, direction variation, and fusion). First , the frame Transformer extracts gaze location and direction features. Second , to analyze the feature difference among transition frames, we introduce TVM guided by transition-aware loss. TVM analyzes the location features to capture the moving trajectory of targets (defined as path variation ), which facilitates the search for target locations near the path. Third , TVM also analyzes the direction features to capture the transition-aware direction area (defined as direction variation ), which facilitates the search for target locations within this area. Fourth , since gaze directions dynamically adjust to track gaze targets, path variation, and direction variation are inherently aligned with the natural movement of a person’s gaze. Thus, these two variations are fused into a unified transition-aware feature, which helps cover all potential target locations. To search for accurate target locations, we embed this transition-aware feature into frame features with cross-attention, which can enhance gaze target detection in transition frames. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two datasets, namely VideoAttentionTarget and VideoCoAtt.