A Generalized Pyramid Matching Kernel for Human Action Recognition in Realistic Videos

Jun Zhu,Rui Zhang,Wenjun Zhang,Quan Zhou,Weijia Zou

doi:10.3390/s131114398

Abstract

Human action recognition is an increasingly important research topic in the fields of video sensing, analysis and understanding. Caused by unconstrained sensing conditions, there exist large intra-class variations and inter-class ambiguities in realistic videos, which hinder the improvement of recognition performance for recent vision-based action recognition systems. In this paper, we propose a generalized pyramid matching kernel (GPMK) for recognizing human actions in realistic videos, based on a multi-channel “bag of words” representation constructed from local spatial-temporal features of video clips. As an extension to the spatial-temporal pyramid matching (STPM) kernel, the GPMK leverages heterogeneous visual cues in multiple feature descriptor types and spatial-temporal grid granularity levels, to build a valid similarity metric between two video clips for kernel-based classification. Instead of the predefined and fixed weights used in STPM, we present a simple, yet effective, method to compute adaptive channel weights of GPMK based on the kernel target alignment from training data. It incorporates prior knowledge and the data-driven information of different channels in a principled way. The experimental results on three challenging video datasets (i.e., Hollywood2, Youtube and HMDB51) validate the superiority of our GPMK w.r.t. the traditional STPM kernel for realistic human action recognition and outperform the state-of-the-art results in the literature.

Highlights

Recognition of human actions, e.g., running, fighting and shooting balls, is an increasingly important research topic in the fields of video sensing, analysis and understanding [1,2,3]
Related work in human action recognition literature can be generally divided into two categories: (1) The first category relies on the technologies of detecting and analyzing human body movement in video sequences and, performs action recognition on that basis; (2) As an extension of a classic framework in the image classification field [23,24,25,26,27], the second category aims at directly building a holistic feature representation of the video clip for human action recognition, based on local spatial-temporal features [28] and the “bag of words”
We propose a new matching kernel to measure the similarity of two video clips, called the generalized pyramid matching kernel (GPMK), to leverage heterogeneous visual cues in multiple feature descriptor types and spatial-temporal grid granularity levels

Summary

Introduction

Recognition of human actions, e.g., running, fighting and shooting balls, is an increasingly important research topic in the fields of video sensing, analysis and understanding [1,2,3]. Promising progress has been achieved for human action recognition in constrained scenarios [12,13], recognition accuracy remains unsatisfactory for realistic videos (e.g., TV, movies and Internet videos) [14,15,16] This is mainly because they are taken under unconstrained sensing conditions and, suffer from a great number of visual challenges (e.g., object pose, background clutter, camera motion, viewpoint and illumination variations), which result in large intra-class variations and inter-class ambiguities that hinder the improvement of recognition performance in recent vision-based action recognition systems. Related work in human action recognition literature can be generally divided into two categories: (1) The first category relies on the technologies of detecting and analyzing human body movement (e.g., kinematic tracking [21], human body pose estimation [22], space-time shape templates [13], etc.) in video sequences and, performs action recognition on that basis; (2) As an extension of a classic framework in the image classification field [23,24,25,26,27], the second category aims at directly building a holistic feature representation of the video clip for human action recognition, based on local spatial-temporal features [28] and the “bag of words”

Methods

Findings

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors	Publication Date: Oct 24, 2013
Citations: 52	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Generalized Pyramid Matching Kernel for Human Action Recognition in Realistic Videos

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors

Lead the way for us

Similar Papers

Real-time Action Recognition by Spatiotemporal Semantic and Structural Forests
Tsz-Ho Yu ... Roberto Cipolla
-
Tsz-Ho Yu, et. al.Tsz-Ho Yu ... Roberto Cipolla
01 Jan 2009
01 Jan 2009

Graph-based approach for 3D human skeletal action recognition
Meng Li ... Howard Leung
Pattern Recognition Letters | VOL. 87
Meng Li, et. al.Meng Li ... Howard Leung
03 Aug 2016
Pattern Recognition Letters | VOL. 87

Human Motion Recognition Using Zero-Shot Learning
Farid Ghareh Mohammadi ... M Hadi Amini
-
Farid Ghareh Mohammadi, et. al.Farid Ghareh Mohammadi ... M Hadi Amini
01 Jan 2020
01 Jan 2020

Recognizing flu-like symptoms from videos.
Tuan Hue Thi ... Li Cheng
BMC Bioinformatics | VOL. 15
Tuan Hue Thi, et. al.Tuan Hue Thi ... Li Cheng
12 Sep 2014
BMC Bioinformatics | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Generalized Pyramid Matching Kernel for Human Action Recognition in Realistic Videos

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors