Abstract

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version (Damen in Scaling egocentric vision: ECCV, 2018), EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection enables new challenges such as action detection and evaluating the “test of time”—i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics.

Highlights

  • Introduction and Related DatasetsSince the dawn of machine learning for computer vision, datasets have been curated to train models, for single tasks from classification (Deng et al 2009; Carreira and Zisserman 2017) to detection (Lin et al 2014; Gu et al 2018), captioning (Karpathy and Fei-Fei 2015; Xu et al 2016) and segmentation (Zhou et al 2017; Perazzi et al 2016)

  • We introduce four new challenges: weakly-supervised action recognition (Sect. 4.2), action detection (Sect. 4.3), unsupervised domain adaptation for action recognition (Sect. 4.5) and action retrieval (Sect. 4.6)

  • The action recognition task itself follows the definition in Overall Verb Noun Act

Read more

Summary

Introduction

Introduction and Related DatasetsSince the dawn of machine learning for computer vision, datasets have been curated to train models, for single tasks from classification (Deng et al 2009; Carreira and Zisserman 2017) to detection (Lin et al 2014; Gu et al 2018), captioning (Karpathy and Fei-Fei 2015; Xu et al 2016) and segmentation (Zhou et al 2017; Perazzi et al 2016). One dataset can be enriched with multiple annotations and tasks, aimed towards learning intermediate representations through downstream and multi-task learning on the same input. This has been recently achieved for autonomous driving (Zhou et al 2019; Geiger et al 2012; Cordts et al 2016; Neuhold et al 2017; Yu et al 2018; Huang et al 2018; Caesar et al 2019; Yogamani et al 2019) and scene understanding (Zamir et al 2018; Silberman et al 2012). Zamir et al (2018) contains 26 tasks ranging from edge detection to vanishing point estimation and scene classification

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call