Detection of eye contact with deep neural networks is as accurate as human experts

Eunji Chong,Rebecca M Jones,Elizabeth Stubbs,James M Rehg,Eliana L Ajodan,Elysha Clark-Whitney,Melanie R Silverman,Audrey Southerland,Chanel Miller,Agata Rozga,Catherine Lord

doi:10.1038/s41467-020-19712-x

Eunji Chong, Rebecca M Jones + Show 9 more

Open Access

PDF Available

https://doi.org/10.1038/s41467-020-19712-x

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Eye contact is among the most primary means of social communication used by humans. Quantification of eye contact is valuable as a part of the analysis of social roles and communication skills, and for clinical screening. Estimating a subject’s looking direction is a challenging task, but eye contact can be effectively captured by a wearable point-of-view camera which provides a unique viewpoint. While moments of eye contact from this viewpoint can be hand-coded, such a process tends to be laborious and subjective. In this work, we develop a deep neural network model to automatically detect eye contact in egocentric video. It is the first to achieve accuracy equivalent to that of human experts. We train a deep convolutional network using a dataset of 4,339,879 annotated images, consisting of 103 subjects with diverse demographic backgrounds. 57 subjects have a diagnosis of Autism Spectrum Disorder. The network achieves overall precision of 0.936 and recall of 0.943 on 18 validation subjects, and its performance is on par with 10 trained human coders with a mean precision 0.918 and recall 0.946. Our method will be instrumental in gaze behavior analysis by serving as a scalable, objective, and accessible tool for clinicians and researchers.

Highlights

Eye contact is among the most primary means of social communication used by humans
In t-tests and χ2 tests that were run to confirm that subjects included in the validation set are representative of the overall sample, the validation set did not differ from the rest of the sample in terms of diagnostic group (χ = 0.09, p = 0.77), gender (χ = 3.62, p = 0.06), age (t = 0.49, p = 0.62, M = 37.72 vs. 36.17, SD = 13.10 vs. 12.15), race (χ = 2.70, p = 0.61), ethnicity (χ = 0.29, p = 0.86), or severity of social impairment among the autism spectrum disorder (ASD) group (t = 1.18, p = 0.24, M = 7.50 vs. 7.81, SD = 1.51 vs. 2.01)
Our model enables the scalable measurement of eye contact during face-to-face interactions

Summary

Introduction

Eye contact is among the most primary means of social communication used by humans. Quantification of eye contact is valuable as a part of the analysis of social roles and communication skills, and for clinical screening. While human raters can achieve levels of agreement above 90% when identifying instances of eye contact in PoV videos[22,23], the accuracy of automated detection approaches achieved in prior works[21,24] is well below this level of performance, making automatic coding unusable by researchers and practitioners as a measurement tool. This paper addresses this challenge by exploring three directions. Establishing this hypothesis validates the feasibility of fully automated eye contact coding using our approach

Methods

Results

Conclusion