Abstract

Gaze prediction is a key issue for visual perception research. It can be used to infer important regions in videos to reduce the amount of computation in learning and inference of various analysis tasks. Vanilla methods for dynamic video unable to extract valid features, and the motion information among dynamic video frames are ignored, which lead to poor prediction results. We propose a gaze prediction based on LSTM convolution with associated features of video frames (LSTM-CVFAF). Firstly, by adding learnable central prior knowledge, the proposed method can effectively and accurately extract the spatial information of each frame. Secondly, the LSTM is deployed to get temporal motion gaze features. Finally, the spatial and temporal motion information is fused to generate the gaze prediction maps of the dynamic video. Compared with the state-of-art models on DHF1K dataset, the CC, AUC-j, sAUC, NSS are separately increased by 5.1%, 0.6%, 38.2% and 0.5%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call