Multimodal spatial-temporal feature representation and its application in action recognition

Shi Haiyong,Chao Xin,Zhong Zhuokun,Hou Zhenjie

doi:10.11834/jig.211217

Abstract

目的在人体行为识别研究中，利用多模态方法将深度数据与骨骼数据相融合，可有效提高动作的识别率。针对深度图像信息数据量大、冗余度高等问题，提出一种通过获取关键时程信息动作帧序列降低冗余的算法，即质心运动路径松弛算法，并根据不同模态数据的特点，提出一种新的时空特征表示方法。方法质心运动路径松弛算法根据质心在相邻帧之间的运动距离，计算图像差分后获得的活跃部分的相似系数，然后剔除掉相似度高的帧，获得足以表达行为的关键时程信息。根据图像动态部分的变化特性、人体各部分在运动中的协同性和局部显著性特征构建一种新的时空特征表示方法。结果在MSR-Action3D数据集上对本文方法的效果进行验证。在3个子集中进行交叉验证的平均分类识别率为95.743 2%，分别比Multi-fused，CovP3DJ，D3D-LSTM（densely connected 3DCNN and long short-term memory），Joint Subset Selection方法高2.443 2%，4.763 2%，0.343 2%，0.213 2%。本文方法在使用完整数据集的扩展实验中进行交叉验证的分类识别率为93.040 3%，具有很好的鲁棒性。结论实验结果表明，本文提出的去冗余算法在降低冗余后提升了识别效果，提取的特征之间具有相关性低的特点，在组合识别中具有良好的互补性，有效提高了分类识别的精确度。;Objective Human body motion-related recognition has been developing in the context of computer vision and pattern recognition like auxiliary human-computer interaction，motion analysis，intelligent monitoring，and virtual reality. To obtain two-dimensional information for its behavioral recognition，conventional motion behavior recognition is mainly used the RGB image sequence captured by RGB camera. To improve the ability to detect short-duration fragments，current feature descriptors for RGB image sequences are employed to characterize human behavior，such as histogram of oriented gradient（HOG），histogram of optical flow （HOF），and a three-dimensional feature pyramid. Some researchers are focused on the feature that image depth is insensitive to ambient light since RGB images are oriented to behavior image sequences of objects in terms of two-dimensional information. The depth information of the image is coordinated with the features of RGB image to describe the related behavior. Human behavior recognition-relevant multi-modal method can be used to fuse depth data and skeleton data，which can improve the recognition rate of action effectively. Recent depth map is widely used in relevant to human behavior recognition. But，the collection of depth information data is required to be optimized because of time complexity of feature extraction and space complexity of feature storage. To resolve the problems，we develop an algorithm to optimize frames of the depth map and resource consumption. At the same time，a new representation of motion features is facilitated as well according to the motion information of the centroid. Method First，the temporal feature vector is used in terms of depth map sequence-extracted time sequence information. The centroid motion path relaxation algorithm is used to realize depth image de-duplication and de redundancy，and the skeleton map-extracted spatial structure feature vector from are spliced to form the spatio-temporal feature input. Next，spatial features are extracted in terms of the original skeleton points coordinates-spliced three-channel spatial feature map. Finally，the fusion probability of spatio-temporal features and spatial features is used for classification and recognition. Our centroid motion path relaxation algorithm is focused on the optimization of redundant information，the time complexity of feature extraction，and the space complexity of feature storage. For the skeleton data，the global feature of motion direction is proposed to fully reflect the integrity and coordination of limb movements. The extracted features are concatenated to obtain the spatio-temporal feature vector，and they can be fused and enhanced through the original coordinates of skeleton points-built three-channel spatial feature map. Its effectiveness is verified on the MSR-Action3D dataset. Result The experimental setting 1 demonstrate that it is 0. 826 0% higher than the depth motion map（DMM）-local binary pattern（LBP）algorithm，1. 015 2% higher than DMM-CRC（collaborative representation classifier），3. 450 1% higher than gradient local auto correlation（DMM-GLAC） algorithm，0. 605 8% higher than EigenJoint algorithm，and 0. 605 8% higher than space-time auto correlation of gradient （STACOG）algorithm is 10. 624 5% higher. After removing redundancy，the result of experimental setting 1 is 0. 126 1% higher as well. The cross-validation on experimental setting 2 show that the average classification and recognition rate in the three subsets is 95. 743 2%，2. 443 2% higher than multi-fused method，4. 763 2% higher than CovP3DJ method，0. 343 2% higher than D3D-LSTM method，and 0. 213 2% higher than joint subset selection method. For the overall data set，it is 2. 030 3% higher than low latency method，0. 240 3% higher than combination of deep models method，and 2. 340 3% higher than complex network coding method. The experimental setting 2 illustrates that the average classification recognition rate of cross-validation in three subsets is 95. 743 2%，and the classification recognition rate of the complete dataset is 93. 040 3%. Conclusion Our algorithm proposed can improve the recognition effect based on redundancy-optimized，and the featuresextracted have lower correlation mutually，which can improve the accuracy of classification recognition effectively.

Full Text