Abstract

Appearance-based gaze estimation has gained more and more attention because of its generality, robustness, and subject independence. Deep learning, which has made a great deal of success in computer vision, has also greatly improved the accuracy of appearance-based gaze estimation. To further reduce the error in gaze estimation, we focus on extracting better feature information from eye and face images. In this paper, we propose a novel multimodal fusion gaze estimation model based on ConvNext and dilated convolution. In this model, the eye image and face image are used as input, and the ConvNext network is used to extract the features of the face image and the eye features are extracted by a dilated convolution-based network, and the feature map of the two images are fused using the fully connected layer to perform gaze estimation. In the experimental part, the designed model is verified on the public dataset MPIIGaze, and compared the proposed model with other gaze estimation models. The experimental results show that our proposed method has greatly improved the accuracy of gaze estimation on the MPIIGaze dataset compared to other related works. Our proposed multimodal fusion gaze estimation model achieves state-of-the-art result on the MPIIGaze dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call