Abstract

Gaze following aims to predict where a person is looking in a scene. Existing methods tend to prioritize traditional 2D RGB visual cues or require burdensome prior knowledge and extra expensive datasets annotated in 3D coordinate systems to train specialized modules to enhance scene modeling. In this work, we introduce a novel framework deployed on a simple ResNet backbone, which exclusively uses image and depth maps to mimic human visual preferences and realize 3D-like depth perception. We first leverage depth maps to formulate spatial-based proximity information regarding the objects with the target person. This process sharpens the focus of the gaze cone on the specific region of interest pertaining to the target while diminishing the impact of surrounding distractions. To capture the diverse dependence of scene context on the saliency gaze cone, we then introduce a learnable grid-level regularized attention that anticipates coarse-grained regions of interest, thereby refining the mapping of the saliency feature to pixel-level heatmaps. This allows our model to better account for individual differences when predicting others’ gaze locations. Finally, we employ the KL-divergence loss to super the grid-level regularized attention, which combines the gaze direction, heatmap regression, and in/out classification losses, providing comprehensive supervision for model optimization. Experimental results on two publicly available datasets demonstrate the comparable performance of our model with less help of modal information. Quantitative visualization results further validate the interpretability of our method. The source code will be available at https://github.com/VUT-HFUT/DepthMatters .

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.