Abstract

In this paper, we propose the spatio-temporal representation matching (STRM) for video-based action recognition under the open-set condition. Open-set action recognition is a more challenging problem than closed-set action recognition since samples of the untrained action class need to be recognized and most of the conventional frameworks are likely to give a false prediction. To handle the untrained action classes, we propose STRM, which involves jointly learning both motion and appearance. STRM extracts spatio-temporal representations from video clips through a joint learning pipeline with both motion and appearance information. Then, STRM computes the similarities between the ST-representations to find the one with highest similarity. We set the experimental protocol for open-set action recognition and carried out experiments on UCF101 and HMDB51 to evaluate STRM. We first investigated the effects of different hyper-parameter settings on STRM, and then compared its performance with existing state-of-the-art methods. The experimental results showed that the proposed method not only outperformed existing methods under the open-set condition, but also provided comparable performance to the state-of-the-art methods under the closed-set condition.

Highlights

  • Action recognition is one of the most challenging aspects of computer vision research, because the complexity and variety of human behaviors makes recognition difficult

  • We propose a spatio-temporal representation (ST-representation) matching (STRM) method based on joint learning of motion and appearance

  • Several recent methods have focused on modeling a long-range temporal structure using combination of 2D convolution and Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) such as [34], [40]

Read more

Summary

INTRODUCTION

Action recognition is one of the most challenging aspects of computer vision research, because the complexity and variety of human behaviors makes recognition difficult. Y. Yoon et al.: STRM-Based Open-Set Action Recognition by Joint Learning of Motion and Appearance object detection [21], [22] image classification [23], [24], and pose estimation [25]–[28]. Difficult problem in itself because of the complexity and variability of human actions, the open-set condition makes action recognition even harder because it contains the unconfined action category. To resolve this issue, we propose a spatio-temporal representation (ST-representation) matching (STRM) method based on joint learning of motion and appearance. The open-set action recognition process using STRM is as follows: Initially, STRM extracts joint spatiotemporal representa- tions (joint ST-representations) from a given video.

RELATED WORKS
JOINT SPATIO-TEMPORAL REPRESENTATION EXTRACTION
11: Select action class where the highest similarity si belongs to
LEARNING STRM
EXPERIMENTS
EXPERIMENTAL SETTING AND DATASET
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call