Abstract

Human action recognition is one of the most challenging tasks in computer vision. Most of the existing works in human action recognition are limited to single-label classification. A real-world video stream, however, often contains multiple human actions. Such a video stream is usually annotated collectively with a set of relevant human action labels, which leads to a multi-label learning problem. Furthermore, there are a great number of meaningful human actions in reality but it would be extremely difficult, if not impossible, to collect/annotate sufficient video clips regarding all these human actions for training a supervised learning model. In this paper, we formulate a real-world human action recognition task as a multi-label zero-shot learning problem. To address this problem, a joint latent ranking embedding framework is proposed. Our framework holistically tackles the issue of unknown temporal boundaries between different actions within a video clip for multi-label learning and exploits the side information regarding the semantic relationship between different human actions for zero-shot learning. Specifically, our framework consists of two component neural networks for visual and semantic embedding respectively. Thus, multi-label zero-shot recognition is done by measuring relatedness scores of concerned action labels to a test video clip in the joint latent visual and semantic embedding spaces. We evaluate our framework in different settings, including a novel data split scheme designed especially for evaluating multi-label zero-shot learning. The experimental results on two weakly annotated multi-label human action datasets (i.e. Breakfast and Charades) demonstrate the effectiveness of our framework.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call