Abstract
In this paper, we propose a novel approach for estimating visual focus of attention in video streams. The method is based on an unsupervised algorithm that incrementally learns the different appearance clusters from low-level visual features extracted from face patches provided by a face tracker. The clusters learnt in that way can then be used to classify the different visual attention targets of a given person during a tracking run, without any prior knowledge on the environment and the configuration of the room or the visible persons. Experiments on public datasets containing almost two hours of annotated videos from meetings and video-conferencing show that the proposed algorithm produces state-of-the-art results and even outperforms a traditional supervised method that is based on head orientation estimation and that classifies visual focus of attention using Gaussian Mixture Models.
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have