Abstract

Zero-shot learning (ZSL) typically explores a shared semantic space in order to recognize novel categories in the absence of any labeled training data. However, the traditional ZSL methods always suffer from serious domain shift problem in human action recognition. This is because: 1) existing ZSL methods are specifically designed for object recognition from static images, which do not capture the temporal dynamics of video sequences, and poor performances are always generated if those methods are directly applied to zero-shot action recognition; 2) these methods always blindly project the target data into a shared space using a semantic mapping obtained by the source data without any adaptation, in which the underlying structures of target data are ignored; and 3) severe inter-class variations exist in various action categories. The traditional ZSL methods do not take relationships across different categories into consideration. In this paper, we propose a novel aligned dynamic-preserving embedding (ADPE) model for zero-shot action recognition in a transductive setting. In our model, an adaptive embedding of target videos is learned, exploring the distributions of both the source and target data. An aligned regularization is further proposed to couple the centers of target semantic representations with their corresponding label prototypes in order to preserve the relationships across different categories. Most significantly, during our embedding, the temporal dynamics of video sequences are simultaneously preserved via exploiting the temporal consistency of video sequences and capturing the temporal evolution of successive segments of actions. Our model can effectively overcome the domain shift problem in zero-shot action recognition. The experiments on Olympic sports, HMDB51, and UCF101 datasets demonstrate the effectiveness of our model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call