Multi-Scale Locality-Constrained Spatiotemporal Coding for Local Feature Based Human Action Recognition

Bin Wang,Wei Xu,Yu Liu,Maojun Zhang,Wei Wang

doi:10.1155/2013/405645

Abstract

We propose a Multiscale Locality-Constrained Spatiotemporal Coding (MLSC) method to improve the traditional bag of features (BoF) algorithm which ignores the spatiotemporal relationship of local features for human action recognition in video. To model this spatiotemporal relationship, MLSC involves the spatiotemporal position of local feature into feature coding processing. It projects local features into a sub space-time-volume (sub-STV) and encodes them with a locality-constrained linear coding. A group of sub-STV features obtained from one video with MLSC and max-pooling are used to classify this video. In classification stage, the Locality-Constrained Group Sparse Representation (LGSR) is adopted to utilize the intrinsic group information of these sub-STV features. The experimental results on KTH, Weizmann, and UCF sports datasets show that our method achieves better performance than the competing local spatiotemporal feature-based human action recognition methods.

Highlights

Human action recognition in video has been widely studied over the last decade due to its widespread application prospects in the areas such as video surveillance [1, 2], actionbased human computer interfaces [3], and video content analysis [4]
We propose a novel framework without estimating action cycles as follows: (1) it densely samples several sub space-time-volume (sub-STV) from one video; (2) it carries out Multiscale Locality-Constrained Spatiotemporal Coding (MLSC) in each sub-STV to obtain a sub-STV descriptor; and (3) it classifies action upon these sub-STV descriptors with Locality-Constrained Group Sparse Representation (LGSR)
bag of features (BoF)-based action representation methods together with existing local feature coding methods vector quantization (VQ), sparse coding (SC), Locality constrained Linear Coding (LLC) [29], and our MLSC are further compared under the same condition that K-nearest Neighbor (KNN) classifier is used in classification stage

Summary

Introduction

Human action recognition in video has been widely studied over the last decade due to its widespread application prospects in the areas such as video surveillance [1, 2], actionbased human computer interfaces [3], and video content analysis [4]. Locality-Constraint Group Sparse Representation [38] is adopted for action classification upon these sub-STV descriptors Compared to these methods which use spatiotemporal context information [34, 35] or feature distribution [36] to handle the limitations of BoF, MLSC is a more fine and whole method, because it records the whole elements (where, when, who, and how) of local features for human action recognition y Code words x t. To solve the limitations of BoF, a novel feature coding method MLSC is proposed for modeling local feature spatiotemporal relationships, at the same time, reducing representation error.

Multiscale Locality-Constrained Spatiotemporal Coding

Human Action Recognition with

Experiment and Analysis

Methods

Conclusion