Abstract
In this paper, we propose the spatio-temporal representation matching (STRM) for video-based action recognition under the open-set condition. Open-set action recognition is a more challenging problem than closed-set action recognition since samples of the untrained action class need to be recognized and most of the conventional frameworks are likely to give a false prediction. To handle the untrained action classes, we propose STRM, which involves jointly learning both motion and appearance. STRM extracts spatio-temporal representations from video clips through a joint learning pipeline with both motion and appearance information. Then, STRM computes the similarities between the ST-representations to find the one with highest similarity. We set the experimental protocol for open-set action recognition and carried out experiments on UCF101 and HMDB51 to evaluate STRM. We first investigated the effects of different hyper-parameter settings on STRM, and then compared its performance with existing state-of-the-art methods. The experimental results showed that the proposed method not only outperformed existing methods under the open-set condition, but also provided comparable performance to the state-of-the-art methods under the closed-set condition.
Highlights
Action recognition is one of the most challenging aspects of computer vision research, because the complexity and variety of human behaviors makes recognition difficult
We propose a spatio-temporal representation (ST-representation) matching (STRM) method based on joint learning of motion and appearance
Several recent methods have focused on modeling a long-range temporal structure using combination of 2D convolution and Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) such as [34], [40]
Summary
Action recognition is one of the most challenging aspects of computer vision research, because the complexity and variety of human behaviors makes recognition difficult. Y. Yoon et al.: STRM-Based Open-Set Action Recognition by Joint Learning of Motion and Appearance object detection [21], [22] image classification [23], [24], and pose estimation [25]–[28]. Difficult problem in itself because of the complexity and variability of human actions, the open-set condition makes action recognition even harder because it contains the unconfined action category. To resolve this issue, we propose a spatio-temporal representation (ST-representation) matching (STRM) method based on joint learning of motion and appearance. The open-set action recognition process using STRM is as follows: Initially, STRM extracts joint spatiotemporal representa- tions (joint ST-representations) from a given video.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have