In recent years, recognizing the visual focus of attention (VFoA) has attracted much attention among computer vision experts due to its various Human–Computer Interaction (HCI) or Human-Robot Interaction (HRI) applications. Although eye gaze is a potential cue to determine someone’s focus of attention (FOA), it is challenging to determine FOA alone when the interacting partners are far away or the camera cannot capture high-resolution images from long-distance. Therefore, the head pose can be used as an approximation to recognize the focus of someone’s attention. This paper proposes a vision-based framework to detect the FOA of humans using nine head poses consisting of four main modules: face detection and facial key-point selection (FDKPSM), head pose classification (HPCM), object localization and classification (OLCM), and focus of attention estimation (FoAEM). The FDKPSM uses the Multi-task Cascaded Neural Network (MTCNN) framework to detect head poses, and the HPCM classifies them into nine classes using the ResNet18. To estimate the FoA, the FoAEM uses a mapping Algorithm (EFoA) which integrates head poses on the focused object. Experimental results show that the proposed model outperformed other deep learning models by achieving the highest accuracy on three datasets: BIWI-M (96.97%), Pointing’04-M (96.04%) and HPoD 9 (98.99%). The visual focus of the attention model gained an accuracy of 94.12% in the multi-object scenario.
Read full abstract