Abstract

Manual classification is still a common method to evaluate event detection algorithms. The procedure is often as follows: Two or three human coders and the algorithm classify a significant quantity of data. In the gold standard approach, deviations from the human classifications are considered to be due to mistakes of the algorithm. However, little is known about human classification in eye tracking. To what extent do the classifications from a larger group of human coders agree? Twelve experienced but untrained human coders classified fixations in 6 min of adult and infant eye-tracking data. When using the sample-based Cohen’s kappa, the classifications of the humans agreed near perfectly. However, we found substantial differences between the classifications when we examined fixation duration and number of fixations. We hypothesized that the human coders applied different (implicit) thresholds and selection rules. Indeed, when spatially close fixations were merged, most of the classification differences disappeared. On the basis of the nature of these intercoder differences, we concluded that fixation classification by experienced untrained human coders is not a gold standard. To bridge the gap between agreement measures (e.g., Cohen’s kappa) and eye movement parameters (fixation duration, number of fixations), we suggest the use of the event-based F1 score and two new measures: the relative timing offset (RTO) and the relative timing deviation (RTD).

Highlights

  • Background of the human codersThe group of coders is too small to perform statistics on the background of the coders in relation to their classifications

  • To bridge the gap between agreement measures (e.g., Cohen’s kappa) and eye movement parameters, we suggest the use of the event-based F1 score and two new measures: the relative timing offset (RTO) and the relative timing deviation (RTD)

  • We will refer to this whole procedure as the strict gold standard approach. An example of this is found in Munn, Stefano, and Pelz (2008) who developed an algorithm to classify fixations produced during the viewing of a dynamic scene. They state the following about their algorithm: BIn comparing the performance of this algorithm to results obtained by three experienced coders, the algorithm performed remarkably well.^ In Zemblys, Niehorster, Kolmogortsev, and Holmqvist (2017) we found an interesting quote concerning human classification: BWe did not have multiple coders to analyze interrater reliability, as this would open another research question of how the coder’s background and experience affect the events produced.^ By using only one coder Zemblys et al can still apply the strict gold standard approach

Read more

Summary

Introduction

The group of coders is too small to perform statistics on the background of the coders in relation to their classifications. The years of experience in eye tracking ranged from 2 to 24 years (Table 1). This does not seem an important factor; maybe the type of experience is more important. All the coders have experience with more than one eye tracker and all of them designed or implemented event classifiers for data analysis or for experiments with gaze contingent displays. It is important to know that they all have experience with low frequency and higher frequency eye trackers and they processed data of low and higher quality (in terms of RMS deviation and data loss)

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.