Articles published on Video Object
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
1684 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.dsp.2026.106006
- May 1, 2026
- Digital Signal Processing
- Tingting Yao + 5 more
OVSMMFA-Net: An object variation sensitive and multi-direction mamba based feature aggregation network for video object detection
- Research Article
- 10.1145/3803013
- Apr 20, 2026
- ACM Transactions on Multimedia Computing, Communications, and Applications
- Feng Zhu + 2 more
Recently, the challenging task of Open-Vocabulary Video Instance Segmentation (OVVIS) has been proposed. The OVVIS task requires simultaneously classifying, segmenting, and tracking objects in videos from an open set of categories, including novel categories unseen during training. Previous approaches typically rely on universal object proposals, memory-induced tracking, and open-vocabulary classification, which are often incompatible with established VIS and open-vocabulary segmentation methods. Observing that recent VIS methods share a common architecture decomposed into a segmenter and a tracker, we design a simple yet effective Switchable Open-vocabulary VIS (SOV) framework. SOV consists of an Open-Vocabulary Segmenter and a Dual Memory Tracker. The segmenter incorporates a frozen CLIP vision encoder as the backbone to enhance generalization on novel categories. The Dual Memory Tracker is training-free and utilizes a dual-memory mechanism to enhance tracking robustness. Moreover, we can easily switch to various trackers. Benefiting from this design, SOV can inherit advantages from state-of-the-art VIS methods. To further optimize training efficiency, we propose a progressive ”Long-Image, Short-Video” training pipeline. This strategy decouples the training process into an extensive image-level pre-training phase followed by a rapid video-level adaptation phase, significantly accelerating convergence while effectively bridging the domain gap between static images and dynamic videos. Our method outperforms previous methods by large margins on various benchmarks while maintaining faster inference speeds. Specifically, SOV achieves 38.0 mAP on the LV-VIS validation set. It also achieves strong zero-shot performance on popular VIS datasets (YTVIS19 50.9 mAP, YTVIS21 45.2 mAP, OVIS 23.1 mAP), comparable to fully-supervised methods. To further validate the flexibility of our switchable architecture, we extend SOV with the state-of-the-art CTVIS tracker, which yields improved performance (51.3 mAP) on YTVIS19. Code is available in the supplementary material.
- Research Article
- 10.1109/tpami.2026.3684742
- Apr 16, 2026
- IEEE transactions on pattern analysis and machine intelligence
- Shengye Qiao + 4 more
Recent progress in semi-supervised video object segmentation has largely hinged on memory-based methods. However, when faced with increasingly tough challenges emerging in complex scenarios, such as fundamental semantic transformations and severe spatial deformations, the fixed-interval memory update mechanism usually adopted in these memory-based methods is insufficient to align with the pivotal moments of object changes. This inflexible mechanism motivates us to design an adaptive memory update mechanism in response to the semantic-spatial changes of target objects. To this end, we propose a novel Change-Sensitive Network (CSNet) to learn when and how to update memory to effectively address intricate challenges in complex scenarios. Specifically, wefirst design an Adaptive Perception-Capture module with a hierarchical contrastive learning loss to determine when to update memory moments by measuring the extent of object changes, thus dividing entire videos into different object-change clips. To further extract and highlight object changes to assist in the segmentation of frames after changes occur, we construct Dynamic Memory Update modules to redefine how to update memory by smoothly retaining the object prototypes within clips and dynamically amplifying the object variations across clips. Extensive experiments demonstrate that our proposed CSNet exhibits clear superiority when evaluated on eight datasets covering three kinds: common, complex and long-video datasets.
- Research Article
- 10.1016/j.knosys.2026.115572
- Apr 1, 2026
- Knowledge-Based Systems
- Guocai Du + 4 more
MTTrack: A joint mamba-transformer framework with memory enhancement for real-time satellite remote sensing video object tracking
- Research Article
- 10.1016/j.knosys.2026.115426
- Apr 1, 2026
- Knowledge-Based Systems
- Yuanlin Zhao + 5 more
Enabling nearshore cross-modal video object detector to learn more accurate spatial and temporal information
- Research Article
1
- 10.1109/tmi.2025.3627954
- Apr 1, 2026
- IEEE transactions on medical imaging
- Yuwen Chen + 7 more
Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos. We show that the proposed method markedly outperforms the default SAM 2, achieving an average Dice Similarity Coefficient improvement of 0.14 and 0.10 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, reducing the time required to correct propagated masks by 60.575% per volume compared to SAM 2, making a notable step toward more accurate automated annotation of medical images for segmentation model development.
- Research Article
- 10.3390/app16062934
- Mar 18, 2026
- Applied Sciences
- Shutong Chen + 2 more
Accurate detection of small objects in video analytics is limited by low pixel resolution and insufficient visual cues. While software-based enhancements often fail to recover missing details, Pan–Tilt–Zoom (PTZ) cameras can physically increase spatial resolution through optical zoom. However, mechanical latency and configuration complexity hinder their real-time applicability. We propose ZoomPatch, a real-time video analytics framework tailored for small object detection. ZoomPatch actively schedules PTZ adjustments to capture optically enhanced subframes of regions of interest (ROIs) and fuses inference results back to the global reference frame. Specifically, it introduces a dynamic Cycle Length Proposer to adapt analysis cycles based on scene motion, and a Mixed Integer Linear Programming (MILP)-based Configuration Decider to determine the optimal sequence of pan, tilt, and zoom adjustments under time budget constraints. Simulation-based experimental evaluations across diverse workloads demonstrate that ZoomPatch significantly outperforms fixed-perspective, super-resolution (SR), and greedy baselines. Notably, in the detection task using YOLOv10, ZoomPatch improves the F1-score from 0.33 to 0.47 (a 42% increase) compared to the fixed-perspective baseline. Furthermore, ZoomPatch yields performance gains of 30% and 7% over the SR baseline (0.36) and the greedy baseline (0.44).
- Research Article
- 10.3791/69299
- Mar 17, 2026
- Journal of visualized experiments : JoVE
- Miaomiao Feng + 1 more
This study aims to assess students' learning engagement in university classrooms using deep learning-based video object detection. To do so, via correlation analysis, this research first identified seven classroom behaviors presenting highly positive correlation with learning engagement as indicators to measure students' learning engagement; then it collected 30 synchronized videos of real classroom teaching from 6 classes from Shandong University of Science and Technology (SDUST) and divided them into a training set and a test set. After the seven behaviors were manually annotated in the training data, a machine learning algorithm was then trained in a supervised manner on this set. Once trained, the model generated initial annotations for the remaining unlabeled data. To achieve more accurate and efficient classroom behavior recognition, this study selected two representative algorithms, namely, Faster R-CNN and YOLOv5s, for behavior detection experiments. Based on a comparison of their detection performance in terms of accuracy and time cost, YOLOv5s was selected for classroom behavior detection in this study. Finally, this study used the focus group method to assign scores to each behavior and develop a three-level learning engagement scoring model. Based on automatically measured behavioral data, the model enables real-time, automatic assessment of learning engagement at both the individual and class levels.
- Research Article
- 10.1016/j.media.2025.103904
- Mar 1, 2026
- Medical image analysis
- Chenxiao Zhang + 2 more
Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank.
- Research Article
- 10.1016/j.neunet.2026.108808
- Mar 1, 2026
- Neural networks : the official journal of the International Neural Network Society
- Lin Xi + 2 more
High-quality, densely annotated data serve as a crucial foundation for developing robust X-ray angiography segmentation models. However, obtaining per-object pixel-level annotations in the medical domain is both expensive and time-consuming, often requiring close collaboration between clinical experts and developers. This paper aims to reduce the annotation costs of X-ray angiography videos by leveraging few-shot video object segmentation (FSVOS), which separates target objects from the background using only a single annotated frame during inference. We introduce a novel FSVOS model that employs a local matching strategy to restrict the search space to the most relevant neighboring pixels. Rather than relying on inefficient standard im2col-like implementations (e.g., spatial convolutions, depthwise convolutions and feature-shifting mechanisms) or hardware-specific CUDA kernels (e.g., deformable and neighborhood attention), which often suffer from limited portability across non-CUDA devices, we reorganize the local sampling process through a direction-based sampling perspective. Specifically, we implement a non-parametric sampling mechanism that enables dynamically varying sampling regions. This approach provides the flexibility to adapt to diverse spatial structures without the computational costs of parametric layers and the need for model retraining. To further enhance feature coherence across frames, we design a supervised spatio-temporal contrastive learning scheme that enforces consistency in feature representations. In addition, we introduce a publicly available benchmark dataset for multi-object segmentation in X-ray angiography videos (MOSXAV), featuring detailed, manually labeled segmentation ground truth. Extensive experiments on the CADICA, XACV, and MOSXAV datasets show that our proposed FSVOS method outperforms current state-of-the-art video segmentation methods in terms of segmentation accuracy and generalization capability (i.e., seen and unseen categories). This work offers enhanced flexibility and potential for a wide range of clinical applications. Code is available at: https://github.com/xilin-x/XRAVOS.
- Research Article
- 10.1016/j.imavis.2026.105945
- Mar 1, 2026
- Image and Vision Computing
- Maojin Sun + 1 more
STSim-Mamb: A spatiotemporal similarity learning framework for unsupervised video object segmentation
- Research Article
- 10.1016/j.knosys.2026.115323
- Mar 1, 2026
- Knowledge-Based Systems
- Lisha Gao + 6 more
Adaptive region encoding for efficient video object detection in edge computing
- Research Article
- 10.1016/j.image.2025.117456
- Mar 1, 2026
- Signal Processing: Image Communication
- Zhiqiang Hou + 5 more
Video object segmentation based on feature compression and attention correction
- Research Article
- 10.3390/technologies14030142
- Feb 27, 2026
- Technologies
- Dongyang Zhou + 1 more
Real-time object detection in soccer videos presents significant challenges due to the dynamic nature of matches, varying object scales, and the stringent requirement for efficient processing. In this work, we define real-time detection as that which achieves inference speeds of at least 30 frames per second (FPS), which is the minimum requirement for smooth video processing and live broadcast applications. While transformer-based detectors have achieved remarkable accuracy, their quadratic computational complexity limits their real-time applications. In this paper, we propose SoccerDETR, a novel real-time detection framework that integrates MobileMamba-based visual state space models with an efficient transformer encoder for soccer object detection. Our approach introduces four key innovations: (1) a MobileMamba backbone leveraging selective state space modeling to achieve linear computational complexity while maintaining global receptive fields; (2) a Semantic-aware Dynamic Feature Fusion Module (SDFM) that adaptively aggregates multi-scale features through progressive semantic injection; (3) a Spatial-Channel Synergistic Attention (SCSA) mechanism that explores the synergistic effects between spatial and channel attention for enhanced feature representation; and (4) a Separable Dynamic Decoder that employs dynamic convolution attention to replace traditional cross-attention, significantly reducing computational overhead. Additionally, we design a Scale-Aware Focal Loss (SAFL) that addresses the class imbalance and scale variation problems inherent in soccer scenarios. Extensive experiments on the Soccana and SoccerNet datasets demonstrate that SoccerDETR achieves state-of-the-art performance with 94.2% mAP@50 on Soccana and 91.8% mAP@50 on SoccerNet, while maintaining real-time inference speed of 78 FPS on a single NVIDIA RTX 4090 GPU with a batch size of 1 and an input resolution 640 × 640. Our method outperforms existing approaches by 2.3–5.7% in mAP while being 1.5–3.2× faster, demonstrating the effectiveness of state space models for efficient sports video object detection. Comprehensive ablation studies validate the effectiveness of each proposed component, and cross-dataset experiments demonstrate strong generalization capability.
- Research Article
- 10.1007/s11042-026-21444-x
- Feb 26, 2026
- Multimedia Tools and Applications
- Han Wu + 1 more
An improved semi-supervised video object segmentation and tracking algorithm for real-time applications
- Research Article
- 10.1145/3790093
- Feb 23, 2026
- ACM Computing Surveys
- Md Meftahul Ferdaus + 4 more
Few-shot learning (FSL) and data-efficient learning paradigms enable object detection models to recognize novel classes from minimally annotated examples, addressing expensive data-labeling challenges. This systematic survey examines recent advances in few-shot, semi-supervised, sparsely-supervised, and weakly-supervised approaches for video and 3D object detection, focusing on developments through foundation models and vision-language model integration. For video object detection, techniques including tube proposals, temporal matching networks, motion-guided approaches, and temporal consistency-based semi-supervised methods utilize spatiotemporal relationships for efficient novel class adaptation, with recent architectures achieving substantial gains from 33 to 48 average precision in few-shot scenarios. For 3D object detection, specialized approaches address point cloud sparsity and texture limitations through uncertainty-aware methods, geometric learning, and multimodal fusion, with sparsely-supervised techniques achieving competitive performance using only 2% of annotations, enabling practical deployment in autonomous driving and robotics. The survey analyzes methodological advances including meta-learning, transfer learning, pseudo-label generation, contrastive instance mining, and foundation model integration across applications spanning autonomous driving, surveillance, robotics, industrial control, and medical imaging. By examining developments across multiple supervision paradigms, this work highlights data-efficient learning’s potential for minimizing annotation requirements and enabling robust real-world deployment across temporal, spatial, and multimodal domains.
- Research Article
- 10.1016/j.neunet.2026.108705
- Feb 10, 2026
- Neural networks : the official journal of the International Neural Network Society
- Bingxun Zhao + 3 more
TransUTD: Underwater cross-domain collaborative spatial-temporal transformer detector.
- Research Article
- 10.1007/s10278-026-01855-w
- Feb 6, 2026
- Journal of imaging informatics in medicine
- Yan Wang + 8 more
Early and accurate diagnosis of nasopharyngeal-laryngeal tumors is critical for improving patient prognosis. Deep learning methods have achieved significant progress in the automatic detection of lesions in static endoscopic images. However, during nasopharyngeal-laryngeal endoscopy, the quality of endoscopic videos often suffers from motion blur, uneven exposure, and reflective artifacts, which adversely affect the performance of existing static image detectors. Therefore, we propose a novel two-stage video lesion detection network, DynSTPN, to address the challenge of lesion detection in complex scenarios. First, in the prompt generation network stage, we design a dynamic prompt generator that generates discriminative prompt based on spatio-temporal feature representations of reference frames to mitigate quality degradation in inference frames. Second, at the object detection network stage, we introduce an adaptive differentiable gating mechanism to integrate reference frames' prompt information, dynamically adjusting the enhancement effect of reference frames on the inference frame. Experiments were conducted on two datasets: the self-constructed four-category nasopharyngeal-laryngeal lesion video object detection (NLLVOD) and the publicly available ImageNet VID dataset. Compared to state-of-the-art (SOTA) methods, DynSTPN achieved the best balance between detection accuracy and efficiency on the VID dataset. On the NLLVOD dataset, DynSTPN achieved a superior detection accuracy of 79.6% and speed of 29.4 FPS, meeting the real-time requirements for clinical applications. These results significantly outperform SOTA static image detector, YOLOv12-M. Experimental results demonstrate that DynSTPN effectively leverages information from video reference frames to enhance detection performance, achieving superior accuracy compared to SOTA image/video methods, thereby offering enhanced clinical applicability.
- Research Article
- 10.1016/j.ecoinf.2026.103674
- Feb 1, 2026
- Ecological Informatics
- Moses Lurbur + 2 more
Towards automated bycatch monitoring: Optimizing and evaluating multi-object tracking of salmon in pollock trawls
- Research Article
- 10.1007/s11263-025-02700-3
- Jan 30, 2026
- International Journal of Computer Vision
- Yuheng Shi + 2 more
Practical Video Object Detection via Feature Selection and Aggregation