Salient Objects In Videos Research Articles

Image saliency detection, to which much effort has been devoted in recent years, has advanced significantly. In contrast, the community has paid little attention to video saliency detection. Especially, existing video saliency models are very likely to fail in videos with difficult scenarios such as fast motion, dynamic background, and nonrigid deformation. Furthermore, performing video saliency detection directly using image saliency models that ignore video temporal information is inappropriate. To alleviate this issue, this study proposes a novel end-to-end spatiotemporal integration network (STI-Net) for detecting salient objects in videos. Specifically, our method is made up of three key steps: feature aggregation, saliency prediction, and saliency fusion, which are used sequentially to generate spatiotemporal deep feature maps, coarse saliency predictions, and the final saliency map. The key advantage of our model lies in the comprehensive exploration of spatial and temporal information across the entire network, where the two features interact with each other in the feature aggregation step, are used to construct boundary cue in the saliency prediction step, and also serve as the original information in the saliency fusion step. As a result, the generated spatiotemporal deep feature maps can precisely and completely characterize the salient objects, and the coarse saliency predictions have well-defined boundaries, effectively improving the final saliency map's quality. Furthermore, “shortcut connections” are introduced into our model to make the proposed network easy to train and obtain accurate results when the network is deep. Extensive experimental results on two publicly available challenging video datasets demonstrate the effectiveness of the proposed model, which achieves comparable performance to state-of-the-art saliency models.

Video captioning is a significant challenging task in computer vision and natural language processing, aiming to automatically describe video content by natural language sentences. Comprehensive understanding of video is the key for accurate video captioning, which needs to not only capture the global content and salient objects in video, but also understand the spatio-temporal relations of objects, including their temporal trajectories and spatial relationships. Thus, it is important for video captioning to capture the objects' relationships both within and across frames. Therefore, in this paper, we propose an object-aware spatio-temporal graph (OSTG) approach for video captioning. It constructs spatio-temporal graphs to depict objects with their relations, where the temporal graphs represent objects' inter-frame dynamics, and the spatial graphs represent objects' intra-frame interactive relationships. The main novelties and advantages are: (1) Bidirectional temporal alignment: Bidirectional temporal graph is constructed along and reversely along the temporal order to perform bidirectional temporal alignment for objects across different frames, which provides complementary clues to capture the inter-frame temporal trajectories for each salient object. (2) Graph based spatial relation learning: Spatial relation graph is constructed among objects in each frame by considering their relative spatial locations and semantic correlations, which is exploited to learn relation features that encode intra-frame relationships for salient objects. (3) Object-aware feature aggregation: Trainable VLAD (vector of locally aggregated descriptors) models are deployed to perform object-aware feature aggregation on objects' local features, which learn discriminative aggregated representations for better video captioning. A hierarchical attention mechanism is also developed to distinguish contributions of different object instances. Experiments on two widely-used datasets, MSR-VTT and MSVD, demonstrate our proposed approach achieves state-of-the-art performances in terms of BLEU@4, METEOR and CIDEr metrics.

Salient Objects In Videos Research Articles

Related Topics

Articles published on Salient Objects In Videos

Flow-Edge-Net: Video Saliency Detection Based on Optical Flow and Edge-Weighted Balance Loss

STI-Net: Spatiotemporal integration network for video saliency detection

Self-Sufficient Feature Enhancing Networks for Video Salient Object Detection

DS-Net: Dynamic spatiotemporal network for video salient object detection

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.

A cellular automata based approach to track salient objects in videos

Video Saliency Detection via Sparsity-Based Reconstruction and Propagation.

Video Salient Object Detection Using Spatiotemporal Deep Features.

Fusing disparate object signatures for salient object detection in video

Video-based salient object detection via spatio-temporal difference and coherence

Coherency Based Spatio-Temporal Saliency Detection for Video Object Segmentation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Salient Objects In Videos Research Articles

Related Topics

Articles published on Salient Objects In Videos

Flow-Edge-Net: Video Saliency Detection Based on Optical Flow and Edge-Weighted Balance Loss

STI-Net: Spatiotemporal integration network for video saliency detection

Self-Sufficient Feature Enhancing Networks for Video Salient Object Detection

DS-Net: Dynamic spatiotemporal network for video salient object detection

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.

A cellular automata based approach to track salient objects in videos

Video Saliency Detection via Sparsity-Based Reconstruction and Propagation.

Video Salient Object Detection Using Spatiotemporal Deep Features.

Fusing disparate object signatures for salient object detection in video

Video-based salient object detection via spatio-temporal difference and coherence

Coherency Based Spatio-Temporal Saliency Detection for Video Object Segmentation