Video spatiotemporal mapping for human action recognition by convolutional neural network

Amin Zare,Arash Sharifi,Hamid Abrishami Moghaddam

doi:10.1007/s10044-019-00788-1

Abstract

In this paper, a 2D representation of a video clip called video spatiotemporal map (VSTM) is presented. VSTM is a compact representation of a video clip which incorporates its spatial and temporal properties. It is created by vertical concatenation of feature vectors generated from subsequent frames. The feature vector corresponding to each frame is generated by applying wavelet transform to that frame (or its subtraction from the subsequent frame) and computing vertical and horizontal projection of quantized coefficients of some specific wavelet subbands. VSTM enables convolutional neural networks (CNNs) to process a video clip for human action recognition (HAR). The proposed approach benefits from power of CNNs to analyze visual patterns and attempts to overcome some CNN challenges such as variable video length problem and lack of training data that leads to over-fitting. VSTM presents a sequence of frames to CNN without imposing any additional computational cost to the CNN learning algorithm. The experimental results of the proposed method on the KTH, Weizmann, and UCF Sports HAR benchmark datasets have shown the supremacy of the proposed method compared with the state-of-the-art methods that used CNN to solve HAR problem.

Full Text