3D Action Recognition Using Depth-Based Feature and Locality-Constrained Affine Subspace Coding
We propose a 3D action recognition algorithm which uses depth-based Gradient Local Auto-Correlations (GLAC) feature and Locality-constrained Affine Subspace Coding (LASC) to improve the discriminative ability of human actions in spatio-temporal subsequences of 3D depth videos. First, each entire depth video sequence is divided automatically into a set of subsequences (i.e., multi-scale sub-actions) by the normalized motion energy vector. Next Depth Motion Maps (DMMs) based GLAC features are employed to capture the shape information and motion cues of each sub-action. In order to obtain a more compact and discriminative representation, LASC is then proposed to encode the features extracted from the depth video. We show that the use of LASC exhibits better performance compared to existing methods such as Locality-constrained Linear Coding (LLC). On all three datasets we obtain competitive results compared to fifteen methods, while using fewer features and less complex models.
- Conference Article
6
- 10.1109/vcip.2017.8305156
- Dec 1, 2017
This paper is to address the problem of human action recognition in depth sequences. The actions with various speeds and shared sub-actions make the recognition challenging. A new feature set, consisting of two heterogeneous features are proposed to address this challenge. Specifically, we propose an adaptive normalized action motion energy based on the depth video. Guided by this multi-scale energy vector, depth sequence and skeleton pose sequence are divided respectively into two sets of subsequences with multiple scales (i.e., multi-scale sub-actions). Then in depth modality, based on the depth sub-sequence, Depth Motion Maps (DMMs) based Histogram Oriented Gradient (HOG) features are employed to capture the shape information and motion cues. In skeleton modality, based on the pose sub-sequence, pose dynamics using skeleton information are extracted. In order to obtain discriminative and compact representation, the Collaborative Representation (CR) learning scheme based classifier is adopted. Experiments on two datasets show the effectiveness of the proposed method.
- Book Chapter
29
- 10.1007/978-3-319-27857-5_55
- Jan 1, 2015
This paper presents a new method for human activity recognition using depth sequences. Each depth sequence is represented by three depth motion maps (DMMs) from three projection views (front, side and top) to capture motion cues. A feature extraction method utilizing spatial and orientational auto-correlations of image local gradients is introduced to extract features from DMMs. The gradient local auto-correlations (GLAC) method employs second order statistics (i.e., auto-correlations) to capture richer information from images than the histogram-based methods (e.g., histogram of oriented gradients) which use first order statistics (i.e., histograms). Based on the extreme learning machine, a fusion framework that incorporates feature-level fusion into decision-level fusion is proposed to effectively combine the GLAC features from DMMs. Experiments on the MSRAction3D and MSRGesture3D datasets demonstrate the effectiveness of the proposed activity recognition algorithm.
- Research Article
27
- 10.1109/tcsvt.2017.2715045
- Oct 1, 2018
- IEEE Transactions on Circuits and Systems for Video Technology
This paper addresses the problem of recognizing human actions from depth videos. We propose a depth-based local descriptor and affine subspace coding representation with locality-constrained affine subspace coding (LASC) for 3D action recognition. First, each depth video sequence is divided into a set of subsequences (i.e., multi-scale sub-actions) based on the normalized motion energy vector. Next, depth motion map-based gradient local auto-correlation features are employed to capture the shape information and motion cues of each sub-action. In order to obtain discriminative and compact representation, we extract the local high-order information of the depth video using LASC. Through experiments, we show that the use of LASC exhibits better performance compared with existing methods such as locality-constrained linear coding. We compared LASC with the state-of-the-art methods based on similar principle, using features extracted from a single modality, on four datasets, and with those using multiple features or nonlinear recognition machines. The results on four datasets clearly show the effectiveness of the proposed method.
- Research Article
3
- 10.1504/ijhpcn.2016.10011433
- Jan 1, 2016
- International Journal of High Performance Computing and Networking
This paper presents an effective approach for recognising human actions from depth video sequences by employing depth motion maps (DMMs) and convolutional neural networks (CNNs). Depth maps are projected onto three orthogonal planes, and frame differences under each view (front/side/top) are then accumulated through an entire depth video sequence generating a DMM. We build a model architecture of multi-view convolutional neural network (MV-CNN) containing multiple networks to deal with three DMMs (DMMf, DMMs, DMMt). The output of full-connected layer under each view is integrated as feature representation, which is then learned in the last softmax regression layer to predict human actions. Experimental results on MSR-Action3D dataset and UTD-MHAD dataset indicate that the proposed approach achieves state-of-the-art recognition performance and is appropriate for real-time recognition.
- Research Article
2
- 10.11834/jig.211217
- Jan 1, 2023
- Journal of Image and Graphics
目的 在人体行为识别研究中,利用多模态方法将深度数据与骨骼数据相融合,可有效提高动作的识别率。针对深度图像信息数据量大、冗余度高等问题,提出一种通过获取关键时程信息动作帧序列降低冗余的算法,即质心运动路径松弛算法,并根据不同模态数据的特点,提出一种新的时空特征表示方法。方法 质心运动路径松弛算法根据质心在相邻帧之间的运动距离,计算图像差分后获得的活跃部分的相似系数,然后剔除掉相似度高的帧,获得足以表达行为的关键时程信息。根据图像动态部分的变化特性、人体各部分在运动中的协同性和局部显著性特征构建一种新的时空特征表示方法。结果 在MSR-Action3D数据集上对本文方法的效果进行验证。在3个子集中进行交叉验证的平均分类识别率为95.743 2%,分别比Multi-fused,CovP3DJ,D3D-LSTM(densely connected 3DCNN and long short-term memory),Joint Subset Selection方法高2.443 2%,4.763 2%,0.343 2%,0.213 2%。本文方法在使用完整数据集的扩展实验中进行交叉验证的分类识别率为93.040 3%,具有很好的鲁棒性。结论 实验结果表明,本文提出的去冗余算法在降低冗余后提升了识别效果,提取的特征之间具有相关性低的特点,在组合识别中具有良好的互补性,有效提高了分类识别的精确度。;Objective Human body motion-related recognition has been developing in the context of computer vision and pattern recognition like auxiliary human-computer interaction,motion analysis,intelligent monitoring,and virtual reality. To obtain two-dimensional information for its behavioral recognition,conventional motion behavior recognition is mainly used the RGB image sequence captured by RGB camera. To improve the ability to detect short-duration fragments,current feature descriptors for RGB image sequences are employed to characterize human behavior,such as histogram of oriented gradient(HOG),histogram of optical flow (HOF),and a three-dimensional feature pyramid. Some researchers are focused on the feature that image depth is insensitive to ambient light since RGB images are oriented to behavior image sequences of objects in terms of two-dimensional information. The depth information of the image is coordinated with the features of RGB image to describe the related behavior. Human behavior recognition-relevant multi-modal method can be used to fuse depth data and skeleton data,which can improve the recognition rate of action effectively. Recent depth map is widely used in relevant to human behavior recognition. But,the collection of depth information data is required to be optimized because of time complexity of feature extraction and space complexity of feature storage. To resolve the problems,we develop an algorithm to optimize frames of the depth map and resource consumption. At the same time,a new representation of motion features is facilitated as well according to the motion information of the centroid. Method First,the temporal feature vector is used in terms of depth map sequence-extracted time sequence information. The centroid motion path relaxation algorithm is used to realize depth image de-duplication and de redundancy,and the skeleton map-extracted spatial structure feature vector from are spliced to form the spatio-temporal feature input. Next,spatial features are extracted in terms of the original skeleton points coordinates-spliced three-channel spatial feature map. Finally,the fusion probability of spatio-temporal features and spatial features is used for classification and recognition. Our centroid motion path relaxation algorithm is focused on the optimization of redundant information,the time complexity of feature extraction,and the space complexity of feature storage. For the skeleton data,the global feature of motion direction is proposed to fully reflect the integrity and coordination of limb movements. The extracted features are concatenated to obtain the spatio-temporal feature vector,and they can be fused and enhanced through the original coordinates of skeleton points-built three-channel spatial feature map. Its effectiveness is verified on the MSR-Action3D dataset. Result The experimental setting 1 demonstrate that it is 0. 826 0% higher than the depth motion map(DMM)-local binary pattern(LBP)algorithm,1. 015 2% higher than DMM-CRC(collaborative representation classifier),3. 450 1% higher than gradient local auto correlation(DMM-GLAC) algorithm,0. 605 8% higher than EigenJoint algorithm,and 0. 605 8% higher than space-time auto correlation of gradient (STACOG)algorithm is 10. 624 5% higher. After removing redundancy,the result of experimental setting 1 is 0. 126 1% higher as well. The cross-validation on experimental setting 2 show that the average classification and recognition rate in the three subsets is 95. 743 2%,2. 443 2% higher than multi-fused method,4. 763 2% higher than CovP3DJ method,0. 343 2% higher than D3D-LSTM method,and 0. 213 2% higher than joint subset selection method. For the overall data set,it is 2. 030 3% higher than low latency method,0. 240 3% higher than combination of deep models method,and 2. 340 3% higher than complex network coding method. The experimental setting 2 illustrates that the average classification recognition rate of cross-validation in three subsets is 95. 743 2%,and the classification recognition rate of the complete dataset is 93. 040 3%. Conclusion Our algorithm proposed can improve the recognition effect based on redundancy-optimized,and the featuresextracted have lower correlation mutually,which can improve the accuracy of classification recognition effectively.
- Research Article
46
- 10.1016/j.patrec.2015.07.015
- Jul 31, 2015
- Pattern Recognition Letters
Discriminative human action classification using locality-constrained linear coding
- Conference Article
622
- 10.1145/2393347.2396382
- Oct 29, 2012
In this paper, we propose an effective method to recognize human actions from sequences of depth maps, which provide additional body shape and motion information for action recognition. In our approach, we project depth maps onto three orthogonal planes and accumulate global activities through entire video sequences to generate the Depth Motion Maps (DMM). Histograms of Oriented Gradients (HOG) are then computed from DMM as the representation of an action video. The recognition results on Microsoft Research (MSR) Action3D dataset show that our approach significantly outperforms the state-of-the-art methods, although our representation is much more compact. In addition, we investigate how many frames are required in our framework to recognize actions on the MSR Action3D dataset. We observe that a short sub-sequence of 30-35 frames is sufficient to achieve comparable results to that operating on entire video sequences.
- Conference Article
19
- 10.1109/mmsp.2015.7340806
- Oct 1, 2015
This paper presents a novel human action recognition method by using depth maps. Each depth frame in a depth video sequence is projected onto three orthogonal Cartesian planes. Under each projection view, we divide the entire depth maps into several sub-actions. The absolute difference between two consecutive projected maps is accumulated through a depth video (several sub-actions) sequence to form a Depth Motion Map (DMM) to describe the dynamic feature of an action. Also the difference within the threshold between two consecutive projected maps is calculated through the entire depth video to form another kind of Depth Static Map (DSM) to describe the static feature. Collectively, we call them Temporal Pyramid of Depth Model (TPDM). Then Spatial Pyramid Histograms of Oriented Gradient (SPHOG) is computed from the TPDM for the representation of an action. For classification, we apply support vector machine (SVM) to classify the proposed descriptorsbased on MSR Action3D dataset. Experimental results demonstrates the effectiveness of our proposed method.
- Research Article
326
- 10.1007/s11554-013-0370-1
- Aug 11, 2013
- Journal of Real-Time Image Processing
This paper presents a human action recognition method by using depth motion maps (DMMs). Each depth frame in a depth video sequence is projected onto three orthogonal Cartesian planes. Under each projection view, the absolute difference between two consecutive projected maps is accumulated through an entire depth video sequence forming a DMM. An l2-regularized collaborative representation classifier with a distance-weighted Tikhonov matrix is then employed for action recognition. The developed method is shown to be computationally efficient allowing it to run in real-time. The recognition results applied to the Microsoft Research Action3D dataset indicate superior performance of our method over the existing methods.
- Conference Article
1
- 10.1109/iecon.2018.8591591
- Oct 1, 2018
Depth motion maps (DMM), containing abundant information on appearance and motion, are captured from the absolute difference between two consecutive depth video sequences. In this paper, each depth frame is first projected onto three orthogonal planes (front, side, top). Then the DMMf, DMMs and DMMt are generated under the three projection view respectively. In order to describe DMM in local and global, histogram of oriented gradient (HOG), local binary patterns (LBP), a local Gist feature description based on a dense grid are computed respectively. Considering the advantages of features fusion and information entropy quantitative evaluation of the Principal Component Analysis (PCA), three descriptors are weighted and fused based on information entropy improved PCA to represent the depth video. A reconstruction error adaptively weighted combination collaborative classifier based on $l$ 1 -norm and $l$ 2 -norm is employed for action recognition, the adaptively weights are determined by Entropy Method. Experimental results on MSR Action3D dataset show that the present approach has strong robustness, discriminability and stability.
- Research Article
52
- 10.4018/ijmdem.2015100102
- Oct 1, 2015
- International Journal of Multimedia Data Engineering and Management
The emerging cost-effective depth sensors have facilitated the action recognition task significantly. In this paper, the authors address the action recognition problem using depth video sequences combining three discriminative features. More specifically, the authors generate three Depth Motion Maps (DMMs) over the entire video sequence corresponding to the front, side, and top projection views. Contourlet-based Histogram of Oriented Gradients (CT-HOG), Local Binary Patterns (LBP), and Edge Oriented Histograms (EOH) are then computed from the DMMs. To merge these features, the authors consider decision-level fusion, where a soft decision-fusion rule, Logarithmic Opinion Pool (LOGP), is used to combine the classification outcomes from multiple classifiers each with an individual set of features. Experimental results on two datasets reveal that the fusion scheme achieves superior action recognition performance over the situations when using each feature individually.
- Research Article
16
- 10.1007/s12652-018-1136-1
- Nov 17, 2018
- Journal of Ambient Intelligence and Humanized Computing
Camera-based action recognition plays a key role in diverse computer vision applications such as human computer interaction. This paper proposes a new action recognition approach using multi-directional projected depth motion map based motion descriptors. First, for the input depth video sequence, all the depth frames in the video are projected onto multiple planes to form the projected images. The absolute difference between two consecutive projected images is accumulated through the entire depth video for establishing maps from multiple views. Then, the local motion consistency of the map is examined to form a histogram of local binary patterns, which are then concatenated and further incorporated into a kernel-based extreme learning machine for action recognition. In contrast to that only three directions are used to calculated the projected depth images for motion feature extraction in the conventional approaches, the proposed approach is able to provide an effective and flexible framework to examine the depth motion maps in multiple projected directions. The proposed approach is evaluated in the well-known MSRA action and gesture video benchmark datasets to demonstrate its superior performance.
- Research Article
32
- 10.1007/s11042-019-7365-2
- Mar 13, 2019
- Multimedia Tools and Applications
In this paper, we present an approach for identification of actions within depth action videos. First, we process the video to get motion history images (MHIs) and static history images (SHIs) corresponding to an action video based on the use of 3D Motion Trail Model (3DMTM). We then characterize the action video by extracting the Gradient Local Auto-Correlations (GLAC) features from the SHIs and the MHIs. The two sets of features i.e., GLAC features from MHIs and GLAC features from SHIs are concatenated to obtain a representation vector for action. Finally, we perform the classification on all the action samples by using the l2-regularized Collaborative Representation Classifier (l2-CRC) to recognize different human actions in an effective way. We perform evaluation of the proposed method on three action datasets, MSR-Action3D, DHA and UTD-MHAD. Through experimental results, we observe that the proposed method performs superior to other approaches.
- Research Article
22
- 10.1007/s11042-016-3988-8
- Sep 28, 2016
- Multimedia Tools and Applications
Hand gesture recognition has many practical applications including human-computer interfaces. Many depth-based features for dynamic hand gesture recognition task have been proposed. However the performance is still unsatisfactory due to the limitation that these features can’t efficiently capture both effective shape information and detailed variation of hands in spatial and temporal domains. In this paper, we propose a new effective descriptor, DLEH2, for depth-based dynamic hand gesture recognition which is developed based on the characteristics of dynamic hand gesture through fusing simple shape and spatio-temporal features of depth sequences. For shape information, depth motion maps (DMMs) are first employed to obtain 3D structure and shape information of hands. To enhance critical shape cues, the local texture and edge information of three DMMs for hand gesture sequences are captured using DLE descriptor. However, DMMs compress the temporal information of the depth sequences into space domain, which loses critical discrimination for temporal sequence recognition to some degree. Simple but effective spatio-temporal features, HOG2, are concatenated with DLE to compensate the temporal information loss during DMMs generation and capture the detailed spatial and temporal variation of hands. Experimental results on two public benchmark datasets, 99.10 % for MSRGesture3D dataset and 98.43 % for SKIG dataset, show that the proposed fusion scheme outperforms the state-of-the-art methods.
- Dissertation
1
- 10.32657/10356/69601
- Jan 1, 2017
Saliency estimation aims to identify visually important regions in an image and to inhibit distractors. It has been used in recent object detectors and image classifiers as a pre-processor to indicate possible object regions in an image. The category-independent object proposals produced by bottom-up saliency approaches include those are irrelevant for tasks like object detection. The precision of the object proposals can be improved through top-down saliency approaches that produce category-specific saliency maps. Although, the prior knowledge about object categories learnt by classifiers are useful for top-down saliency estimation, the relationship between image classifiers and top-down salient object detectors has not been explored substantially. In this thesis we develop classifier-based approaches for top-down salient object detection in which first two are trained in a fully supervised setting and the last two are trained in a weakly supervised setting. Non-linear feature representations such as sparse coding (SC) or locality constrained linear coding (LLC) cascaded with linear classifiers are proven to be effective in image classification. They are also used for top-down salient object detection to achieve a compact and discriminative representation of SIFT features, which helps to model feature selectivity for saliency map. We analyze the influence of these feature coding approaches in top-down salient object detection and also propose a novel coding strategy for top-down saliency estimation. The proposed coding strategy ensures that similar codes are assigned to the features which are adjacent in spatial, feature and category domains. These Locality constrained contextual sparse codes are max-pooled over a spatial neighborhood and a logistic regression classifier learnt on these max-pooled vectors is used for saliency estimation. Many practical computer vision systems need to simultaneously identify the presence of an object as well as to segment it. Moreover, image classifiers and top-down salient object detection often share similar modules such as feature extractor, feature coding and feature classifier. This motivated us to develop our second fully supervised top-down saliency approach, which is a joint framework for saliency estimation and image classification. In this framework, the image classifier is used both to quantify the likelihood of the presence of an object and to update the saliency map using a novel saliency refinement method. A novel saliency-weighted max-pooling is proposed to improve image classification by weighting the max-pooled vector in each block of the spatial pyramid with a weight computed using top-down saliency maps. Conventional top-down saliency approaches require fully supervised training in which exact object annotation is required. Availability of images from a simple tag-based internet search has made exact annotation for training saliency models unnecessary. This motivated us to develop weakly supervised top-down saliency approaches that are trained with image-level labels indicating the presence or absence of an object of interest. First, the probabilistic contribution of each patch in the image to the confidence score of a sparse coded spatial pyramid max-pooling (ScSPM) image classifier is analyzed to estimate its Reverse-ScSPM (R-ScSPM ) saliency. For high-level understanding of the surrounding spatial region, contextual information of the patch is required, which is incorporated using a contextual saliency module. Besides illustrating the accuracy of saliency maps produced by the proposed method, we demonstrate its effectiveness in applications like weakly supervised object annotation, class segmentation and action classification. Finally, we develop a convolutional neural network (CNN) based, weakly supervised salient object detection approach that has both bottom-up and top-down modules. Here, we modify the backtracking strategy to identify salient regions that make positive contribution to a CNN-based image classifier. From a set of saliency maps of an image produced by fast bottom-up saliency approaches, we propose a novel strategy to select the best saliency map suitable for the top-down task. The selected bottom-up saliency map is combined with the top-down saliency map. Features having high combined saliency are used to train a linear SVM classifier to estimate contextual saliency. This is integrated with combined saliency and further refined through a multi-scale superpixel-averaging of saliency map. Experiments are carried out on seven challenging datasets and quantitative results are compared with 36 closely related approaches across 4 different applications.