Abstract

Gaze-Following aims to predict the gaze target of a subject within an image, and information on orientation and depth greatly improves this task. However, previous methods require additional datasets to obtain depth or orientation information, leading to cumbersome training or inference processes. To this end, we propose an end-to-end depth-aware gaze-following approach that incorporates depth and orientation information without additional datasets. Our approach identifies a primary task, gaze-following, supervised by true labels from the gaze-following dataset and two auxiliary tasks, scene depth estimation and 3D orientation estimation, supervised by generated pseudo labels. Intermediate auxiliary features are integrated into the primary task network as implicit information. We propose a residual filter module for screening useful information that can enhance gaze-following prediction performance. Extensive experiments on GazeFollow and VideoAttentionTarget show that our approach achieves state-of-the-art results (0.120 Ave. Dist. achieved on GazeFollow and 0.104 L2 Dist. achieved on VideoAttentionTarget). Finally, we apply our approach to a real robot for understanding human attention and intention. Compared to the previous depth considered gaze-following method, our method saves half of the computation time.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call