Real-time Action Recognition by Spatiotemporal Semantic and Structural Forests

Tsz-Ho Yu,Roberto Cipolla,Tae-Kyun Kim

doi:10.5244/c.24.52

Abstract

This paper presents a novel real-time action recogniser by utilising both local appearance and structural information. Our method is able to recognise actions continuously in real-time while achieving comparably high accuracy over state-of-the-arts. Run-time speed is of vital importance in real-world action recognition systems, but existing methods seldom take computational complexity into full consideration. A class label is assigned after an entire query video is analysed, or a large lookahead is required to recognise an action. In addition, the “bag of words”(BOW) has proven effective for action recognition [5]. However, the standard BOW model ignores the spatiotemporal relationships among feature descriptors, which are useful for describing actions. Addressing these challenges, we present a novel approach for action recognition. The major contributions include the followings: Efficient Spatiotemporal Codebook Learning: We extend the use of semantic texton forests [6] (STFs) from 2D image segmentation to spatiotemporal analysis. As well as being much faster than a traditional flat codebook such as k-means clustering, STFs achieve high accuracy comparable to that of existing approaches. STFs are ensembles of random decision trees that textonise input video patches into semantic textons. Since only a small number of simple features are used to traverse the trees, STFs are extremely fast to evaluate. They also serve a powerful discriminative codebook by multiple decision trees. Figure 1 illustrates how visual codewords are generated using STFs in the proposed method. Combined Structural and Appearance Information: We propose a richer description of features, hence actions can be classified in very short video sequences. Based on [3], we introduce the pyramidal spatiotemporal relationship match (PSRM) to encapsulate both local appearance and structural information efficiently. Subsequences are sampled from an input video in short intervals (e.g. ≤ 10 frames). After spatiotemporal interest points are localised, the trained STFs assign visual codewords to the features. A set of pairwise spatiotemporal associations are designed to capture the structural relationships among features (i.e. pairwise distances along space-time axes). All possible pairs in the bag of features are analysed by the association rules and stored in the 3-D histogram. PSRM leverages the properties of semantic trees and pyramidal match kernels. Multiple pyramidal histograms are then combined to classify a query video. Figure 2 illustrates how the relationship histograms are constructed and matched using PSRM. For each tree in STFs, the threedimensional histogram is constructed according to their spatiotemporal structures (see figure 2 (left)). Its hierarchical structure offers a time efficient way to perform the pyramid match kernel [1] for codeword matching (figure 2 (right)). Enhanced Efficiency and Combined Classification: Several techniques are employed to improve the recognition speed and accuracy. A novel spatiotemporal interest point detector, called V-FAST, is designed based on the FAST 2D corners [2]. The recognition accuracy is enhanced by adaptively combining PSRM and the bag of semantic texton (BOST) method [6]: the k-means forest classifier is learned using PSRM as a matching kernel. The task of action recognition is performed separately Spatiotemporal Relationship Match of visual codewords from Semantic Texton Forest Pyramid Match Kernel is utilised to match the histograms Feature Extraction Feature Matching

Full Text