Video instance segmentation, a key technology for intelligent sensing in visual perception, plays a key role in automated surveillance, robotics, and smart cities. These scenarios rely on real-time and efficient target-tracking capabilities for accurate perception and intelligent analysis of dynamic environments. However, traditional video instance segmentation methods face complex models, high computational overheads, and slow segmentation speeds in time-series feature extraction, especially in resource-constrained environments. To address these challenges, a Dual-Channel and Frequency-Aware Approach for Lightweight Video Instance Segmentation (DCFA-LVIS) is proposed in this paper. In feature extraction, a DCEResNet backbone network structure based on a dual-channel feature enhancement mechanism is designed to improve the model’s accuracy by enhancing the feature extraction and representation capabilities. In instance tracking, a dual-frequency perceptual enhancement network structure is constructed, which uses an independent instance query mechanism to capture temporal information and combines with a frequency-aware attention mechanism to capture instance features on different attention layers of high and low frequencies, respectively, to effectively reduce the complexity of the model, decrease the number of parameters, and improve the segmentation efficiency. Experiments show that the model proposed in this paper achieves state-of-the-art segmentation performance with few parameters on the YouTube-VIS dataset, demonstrating its efficiency and practicality. This method significantly enhances the application efficiency and adaptability of visual perception intelligent sensing technology in video data acquisition and processing, providing strong support for its widespread deployment.
Read full abstract