Abstract

Over the past few years, gaze prediction technology has rapidly developed in human visual attention mechanism research. However, there is still a large gap between the prediction models and the human visual system (HVS). This work makes four contributions to video egocentric gaze prediction research. First, we introduce a new benchmark named ECVG (egocentric video gaze), which consists of 3K high-quality, elaborately selected video sequences spanning a large range of scenes. Existing eye-tracking datasets limit human activities. In contrast, ECVG achieves the collection of eye movement data under real human activity. Second, we establish a visual attention-selection and gaze-tracking model, which summarizes the human visual attention process. Third, we propose a novel feature pyramid interactive attention 3D network (FPIANet) for egocentric gaze prediction, which directly captures the long-term relationship between spatiotemporal features at different time steps and achieves feature nonlocal interactions across temporal, spatial, and scales. In addition, we propose a multiscale interactive attention (MSIA) module, which first explicitly integrates the human top-down and bottom-up visual attention mechanisms to transform any spatiotemporal feature into another feature of the same size but with richer contexts. Furthermore, we propose a new gaze transfer path consistency evaluation and thoroughly examine the performance of our model on two large datasets (ECVG and LEDOV). The experimental results show that our model outperforms the existing state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call