Direct observation (DO) is a widely accepted ground-truth measure, but the field lacks standard operational definitions. Research groups develop project-specific annotation platforms, limiting the utility of DO if labels are not consistent. Purpose: The purpose was to evaluate within- and between-site agreement for DO taxonomies (e.g., activity intensity category) across four independent research groups who have used video-recorded DO. Methods: Each site contributed video files (508 min) and had two trained research assistants annotate the shared video files according to their existing annotation protocols. The authors calculated (a) within-site agreement for the two coders at the same site expressed as intraclass correlation and (b) between-site agreement, the proportion of seconds that agree between any two coders regardless of site. Results: Within-site agreement at all sites was good–excellent for both activity intensity categories (intraclass correlation range: .82–.9) and posture/whole-body movement (intraclass correlation range: .77–.98). Between-site agreement for intensity categories was 94.6% for sedentary, 80.9% for light, and 82.8% for moderate–vigorous. Three of the four sites had common labels for eight posture/whole-body movements and had within-site agreements of 94.5% and between-site agreements of 86.1%. Conclusions: Distinct research groups can annotate key features of physical behavior with good-to-excellent interrater reliability. Operational definitions are provided for core metrics for researchers to consider in future studies to facilitate between-study comparisons and data pooling, enabling the deployment of deep learning approaches to wearable device algorithm calibration.
Read full abstract