Abstract

Human action recognition is an increasingly important research topic in the fields of video sensing, analysis and understanding. Caused by unconstrained sensing conditions, there exist large intra-class variations and inter-class ambiguities in realistic videos, which hinder the improvement of recognition performance for recent vision-based action recognition systems. In this paper, we propose a generalized pyramid matching kernel (GPMK) for recognizing human actions in realistic videos, based on a multi-channel “bag of words” representation constructed from local spatial-temporal features of video clips. As an extension to the spatial-temporal pyramid matching (STPM) kernel, the GPMK leverages heterogeneous visual cues in multiple feature descriptor types and spatial-temporal grid granularity levels, to build a valid similarity metric between two video clips for kernel-based classification. Instead of the predefined and fixed weights used in STPM, we present a simple, yet effective, method to compute adaptive channel weights of GPMK based on the kernel target alignment from training data. It incorporates prior knowledge and the data-driven information of different channels in a principled way. The experimental results on three challenging video datasets (i.e., Hollywood2, Youtube and HMDB51) validate the superiority of our GPMK w.r.t. the traditional STPM kernel for realistic human action recognition and outperform the state-of-the-art results in the literature.

Highlights

  • Recognition of human actions, e.g., running, fighting and shooting balls, is an increasingly important research topic in the fields of video sensing, analysis and understanding [1,2,3]

  • Related work in human action recognition literature can be generally divided into two categories: (1) The first category relies on the technologies of detecting and analyzing human body movement in video sequences and, performs action recognition on that basis; (2) As an extension of a classic framework in the image classification field [23,24,25,26,27], the second category aims at directly building a holistic feature representation of the video clip for human action recognition, based on local spatial-temporal features [28] and the “bag of words”

  • We propose a new matching kernel to measure the similarity of two video clips, called the generalized pyramid matching kernel (GPMK), to leverage heterogeneous visual cues in multiple feature descriptor types and spatial-temporal grid granularity levels

Read more

Summary

Introduction

Recognition of human actions, e.g., running, fighting and shooting balls, is an increasingly important research topic in the fields of video sensing, analysis and understanding [1,2,3]. Promising progress has been achieved for human action recognition in constrained scenarios [12,13], recognition accuracy remains unsatisfactory for realistic videos (e.g., TV, movies and Internet videos) [14,15,16] This is mainly because they are taken under unconstrained sensing conditions and, suffer from a great number of visual challenges (e.g., object pose, background clutter, camera motion, viewpoint and illumination variations), which result in large intra-class variations and inter-class ambiguities that hinder the improvement of recognition performance in recent vision-based action recognition systems. Related work in human action recognition literature can be generally divided into two categories: (1) The first category relies on the technologies of detecting and analyzing human body movement (e.g., kinematic tracking [21], human body pose estimation [22], space-time shape templates [13], etc.) in video sequences and, performs action recognition on that basis; (2) As an extension of a classic framework in the image classification field [23,24,25,26,27], the second category aims at directly building a holistic feature representation of the video clip for human action recognition, based on local spatial-temporal features [28] and the “bag of words”

Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call