Abstract

Research on appearance gaze estimation based on deep learning has achieved considerable results. However, a lack of information in the extracted local blocks reduces the precision of gaze estimation since CNNs do not prioritize the information within the key picture blocks for face image estimation. In this research, fine-grained visual information is extracted from local blocks using a Transformer in Transformer (TNT) model that emphasizes the interactions between local blocks. First, TNT-based gaze estimation models (GTiT-Pure and GTiT-Hybrid) are established with the TNT model to capture both coarse- and fine-grained visual information about the face, which are then combined to provide a feature representation for gaze regression. Then, the performance of the two models are evaluated on four gaze estimation datasets. Experimental results demonstrate that the pre-trained GTiT-Pure has less gaze estimation error than the majority of CNNs and Transformer models, and that the GTiT-Hybrid model performs the best on the EyeDiap and RT-Gene datasets. The GTiT model which combines both coarse- and fine-grained gaze features in face images, can be used to further explore the advantages of the Transformer model in gaze feature extraction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call