Abstract

The recognition of human actions recorded in a multi-camera environment faces the challenging issue of viewpoint variation. Multi-view methods employ videos from different views to generate a compact view-invariant representation of human actions. This paper proposes a novel multi-view human action recognition approach that uses multiple low-dimensional temporal templates and a reconstruction-based encoding scheme. The proposed approach is based upon the extraction of multiple 2D motion history images (MHIs) of human action videos over non-overlapping temporal windows, constructing multiple batches of motion history images (MB-MHIs). Then, two kinds of descriptions are computed for these MHIs batches based on (1) a deep residual network (ResNet) and (2) histogram of oriented gradients (HOG) to effectively quantify a change in gradient. ResNet descriptions are average pooled at each batch. HOG descriptions are processed independently at each batch to learn a class-based dictionary using a K-spectral value decomposition algorithm. Later, the sparse codes of feature descriptions are obtained using an orthogonal matching pursuit approach. These sparse codes are average pooled to extract encoded feature vectors. Then, encoded feature vectors at each batch are fused to form a final view-invariant feature representation. Finally, a linear support vector machine classifier is trained for action recognition. Experimental results are given on three versions of a multi-view dataset: MuHAVi-8, MuHAVi-14, and MuHAVi-uncut. The proposed approach shows promising results when tested for a novel camera. Results on deep features indicate that action representation by MB-MHIs is more view-invariant than single MHIs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call