Abstract

It is challenging to achieve zero-shot action recognition. Current approaches utilize the names or classification scores of the detected objects to model object relations in images, and the recognition performances highly rely on the accuracy of object classification. In fact, humans have the capability to infer unseen action categories using visual knowledge such as motion patterns and object relations. In this work, a novel model is proposed for zero-shot action recognition, which jointly captures object relations of one static frame and models temporal motion patterns of adjacent frames. Specifically, an object detector first detects and extracts object features. Then graph convolutions are conducted to effectively leverage the relations of objects. Meanwhile, three-dimensional convolutional neural networks are adopted to model temporal information. Finally, the above two outputs are separately fed into visual-to-semantic modules to project the visual features into the semantic space. Moreover, a prior knowledge learning method is devised to introduce visual commonsense knowledge with the help of an external dataset. Extensive experiments are conducted on three benchmark datasets of Olympic Sports, HMDB51, and UCF101 to demonstrate the superiority of the proposed model compared to the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call