Abstract

Recently, Action Recognition (AR) is facing the scalability problem, since collecting and annotating data for the ever-growing action categories is exhausting and inappropriate. As an alternative to AR, Zero-Shot Action Recognition (ZSAR) is getting more and more attention in the community, as they could utilize a shared semantic/attribute space to recognize novel categories without annotated data. Different from the AR focuses on learning the correlation between actions, ZSAR needs to consider the correlation of action-action, label-label and action-label at the same time. However, as far as we know, there is no work to provide structural guidance for the framework design of ZSAR according to its task characteristics. In this paper, we demonstrate the rationality of using the Energy-Based Model (EBM) to guide the framework design of ZSAR based on their inference mechanism. Furthermore, under the guidance of EBM, we propose an Energy-based Temporal Summarized Attentive Network (ETSAN) to achieve ZSAR. Specifically, to ensure the effectiveness of cross-modal matching, EBM needs to capture the correlations of input-input, output-output and input-output, based on discriminative and focused input and output space. To this end, we first design the Temporal Summarized Attentive Mechanism (TSAM) to capture the correlation of action-action by constructing discriminative and focused input space. Then, a Label Semantic Adaptive Mechanism (LSAM) is proposed to learn the correlation of label-label by adjusting the semantic structure according to the target task. Finally, we devise an Energy Score Estimation Mechanism (ESEM) to measure the compatibility (i.e. energy score) between video representation and label semantic embedding. With end-to-end training, our framework can capture all three of the correlations mentioned above simultaneously by minimizing the energy score of the correct action-label pair. Experiments on the HMDB51 and UCF101 datasets show that the proposed architecture achieves comparable results among methods based on the spatial-temporal visual feature of sequence-level, which demonstrates the efficiency of the EBM in guiding the framework design of ZSAR. In addition, our code is available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/oOHCIOo/ETSAN</uri> .

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.