Abstract

Current video saliency prediction methods have made great progress relying on the feature extraction capability of CNN, but there are still many defects in hierarchical feature fusion, limiting the further improvement of accuracy. To address this issue, we propose a 3D convolutional Hierarchical Spatiotemporal Feature Fusion Network (HSFF-Net). Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the flow of deep semantic information. Then, different from addition and concatenation, we design a Hierarchical Adaptive Fusion (HAF) mechanism that can adaptively learn the fusion weights of adjacent features. Moreover, a Frame-wise Attention (FA) module is introduced to augment the temporal features to be fused. Our model is simple yet effective and can run in real-time. Experimental results on the three video saliency benchmarks demonstrate that the HSFF-Net outperforms existing state-of-the-art methods in accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call