Abstract

Our goal is to predict the camera wearer’s location and pose in his/her environment based on what’s captured by the camera wearer’s first-person wearable camera. Toward this goal, we first collect a new dataset in which the camera wearer performs various activities (e.g., opening a fridge, reading a book) in different scenes with time-synchronized first-person and stationary third-person cameras. We then propose a novel deep network architecture, which takes as input the first-person video frames and empty third-person scene image (without the camera wearer) to predict the location and pose of the camera wearer. We explore and compare our approach with several intuitive baselines and show initial promising results on this novel, challenging problem.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call