Abstract

Most previous works on the first-person video recognition focus on measuring the similarity of different actions by using low-level features of objects interacting with humans. However, due to noisy camera motion and frequent changes in viewpoint and scale, they fail to capture and model highly discriminative object features. In this paper, we propose a novel pipeline for the first-person daily activity recognition. Our object feature extraction pipeline is inspired by the recent success of object hypotheses and deep convolutional neural network (CNN)-based detection frameworks. Our key contribution is a simple yet effective manipulated object proposal generation scheme. This scheme leverages motion cues, such as motion boundary and motion magnitude (in contrast, camera motion is usually considered as “noise” for most previous methods), to generate a more compact and discriminative set of object proposals, which are more closely related to the objects, which are being manipulated. Then, we learn more discriminative object detectors from these manipulated object proposals based on region-based CNN. Meanwhile, we develop a non-linear feature fusion scheme, which better combines object and motion features. We show in experiments that the proposed framework significantly outperforms the state-of-the-art recognition performance on a challenging first-person daily activity benchmark.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.