Abstract

A better understanding of human interactions in videos can be achieved by simultaneously considering the coarse interactions between people, the action of each individual, and the activity of all people as a whole. We divide the recognition task into two stages. The first stage discriminates interactions and noninteractions, actions and activities based on local image information, while during the second stage, actions and activities are recognized in a global manner based on the local recognition results. A conditional random field (CRF) is designed to model human interactions in the spatio-temporal space. Different from most existing global models which cover either action or activity variables only, our model covers them both by considering the interactions between different types of variables. The graph structure of the CRF is predicted by a model learned from training data, which is different from traditional graph construction methods that typically rely on human heuristics. We learn the parameters of the CRF via structured support vector machine. We propose an efficient inference algorithm to tackle the estimation of labels in long videos containing many people. Our model admits both semantic-level understanding of human interactions in videos and competitive action and activity recognition performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.