Multimodal spatial-temporal feature representation and its application in action recognition
目的 在人体行为识别研究中,利用多模态方法将深度数据与骨骼数据相融合,可有效提高动作的识别率。针对深度图像信息数据量大、冗余度高等问题,提出一种通过获取关键时程信息动作帧序列降低冗余的算法,即质心运动路径松弛算法,并根据不同模态数据的特点,提出一种新的时空特征表示方法。方法 质心运动路径松弛算法根据质心在相邻帧之间的运动距离,计算图像差分后获得的活跃部分的相似系数,然后剔除掉相似度高的帧,获得足以表达行为的关键时程信息。根据图像动态部分的变化特性、人体各部分在运动中的协同性和局部显著性特征构建一种新的时空特征表示方法。结果 在MSR-Action3D数据集上对本文方法的效果进行验证。在3个子集中进行交叉验证的平均分类识别率为95.743 2%,分别比Multi-fused,CovP3DJ,D3D-LSTM(densely connected 3DCNN and long short-term memory),Joint Subset Selection方法高2.443 2%,4.763 2%,0.343 2%,0.213 2%。本文方法在使用完整数据集的扩展实验中进行交叉验证的分类识别率为93.040 3%,具有很好的鲁棒性。结论 实验结果表明,本文提出的去冗余算法在降低冗余后提升了识别效果,提取的特征之间具有相关性低的特点,在组合识别中具有良好的互补性,有效提高了分类识别的精确度。;Objective Human body motion-related recognition has been developing in the context of computer vision and pattern recognition like auxiliary human-computer interaction,motion analysis,intelligent monitoring,and virtual reality. To obtain two-dimensional information for its behavioral recognition,conventional motion behavior recognition is mainly used the RGB image sequence captured by RGB camera. To improve the ability to detect short-duration fragments,current feature descriptors for RGB image sequences are employed to characterize human behavior,such as histogram of oriented gradient(HOG),histogram of optical flow (HOF),and a three-dimensional feature pyramid. Some researchers are focused on the feature that image depth is insensitive to ambient light since RGB images are oriented to behavior image sequences of objects in terms of two-dimensional information. The depth information of the image is coordinated with the features of RGB image to describe the related behavior. Human behavior recognition-relevant multi-modal method can be used to fuse depth data and skeleton data,which can improve the recognition rate of action effectively. Recent depth map is widely used in relevant to human behavior recognition. But,the collection of depth information data is required to be optimized because of time complexity of feature extraction and space complexity of feature storage. To resolve the problems,we develop an algorithm to optimize frames of the depth map and resource consumption. At the same time,a new representation of motion features is facilitated as well according to the motion information of the centroid. Method First,the temporal feature vector is used in terms of depth map sequence-extracted time sequence information. The centroid motion path relaxation algorithm is used to realize depth image de-duplication and de redundancy,and the skeleton map-extracted spatial structure feature vector from are spliced to form the spatio-temporal feature input. Next,spatial features are extracted in terms of the original skeleton points coordinates-spliced three-channel spatial feature map. Finally,the fusion probability of spatio-temporal features and spatial features is used for classification and recognition. Our centroid motion path relaxation algorithm is focused on the optimization of redundant information,the time complexity of feature extraction,and the space complexity of feature storage. For the skeleton data,the global feature of motion direction is proposed to fully reflect the integrity and coordination of limb movements. The extracted features are concatenated to obtain the spatio-temporal feature vector,and they can be fused and enhanced through the original coordinates of skeleton points-built three-channel spatial feature map. Its effectiveness is verified on the MSR-Action3D dataset. Result The experimental setting 1 demonstrate that it is 0. 826 0% higher than the depth motion map(DMM)-local binary pattern(LBP)algorithm,1. 015 2% higher than DMM-CRC(collaborative representation classifier),3. 450 1% higher than gradient local auto correlation(DMM-GLAC) algorithm,0. 605 8% higher than EigenJoint algorithm,and 0. 605 8% higher than space-time auto correlation of gradient (STACOG)algorithm is 10. 624 5% higher. After removing redundancy,the result of experimental setting 1 is 0. 126 1% higher as well. The cross-validation on experimental setting 2 show that the average classification and recognition rate in the three subsets is 95. 743 2%,2. 443 2% higher than multi-fused method,4. 763 2% higher than CovP3DJ method,0. 343 2% higher than D3D-LSTM method,and 0. 213 2% higher than joint subset selection method. For the overall data set,it is 2. 030 3% higher than low latency method,0. 240 3% higher than combination of deep models method,and 2. 340 3% higher than complex network coding method. The experimental setting 2 illustrates that the average classification recognition rate of cross-validation in three subsets is 95. 743 2%,and the classification recognition rate of the complete dataset is 93. 040 3%. Conclusion Our algorithm proposed can improve the recognition effect based on redundancy-optimized,and the featuresextracted have lower correlation mutually,which can improve the accuracy of classification recognition effectively.
- Conference Article
43
- 10.1109/wacv.2013.6475019
- Jan 1, 2013
This paper presents a wildfire smoke detection method based on a spatiotemporal bag-of-features (BoF) and a random forest classifier. First, candidate blocks are detected using key-frame differences and non-parametric color models to reduce the computation time. Subsequently, spatiotemporal three-dimensional (3D) volumes are built by combining the candidate blocks in the current key-frame and the corresponding blocks in previous frames. A histogram of gradient (HOG) is extracted as a spatial feature, and a histogram of optical flow (HOF) is extracted as a temporal feature based on the fact that the diffusion direction of smoke is upward owing to thermal convection. Using these spatiotemporal features, a codebook and a BoF histogram are generated from training data. For smoke verification, a random forest classifier is built during the training phase by using the BoF histogram. The random forest with BoF histogram can increase the detection accuracy and allow smoke detection to be carried out in near real-time.
- Research Article
13
- 10.1038/s41598-023-45211-2
- Nov 22, 2023
- Scientific Reports
Behavior is one of the important factors reflecting the health status of dairy cows, and when dairy cows encounter health problems, they exhibit different behavioral characteristics. Therefore, identifying dairy cow behavior not only helps in assessing their physiological health and disease treatment but also improves cow welfare, which is very important for the development of animal husbandry. The method of relying on human eyes to observe the behavior of dairy cows has problems such as high labor costs, high labor intensity, and high fatigue rates. Therefore, it is necessary to explore more effective technical means to identify cow behaviors more quickly and accurately and improve the intelligence level of dairy cow farming. Automatic recognition of dairy cow behavior has become a key technology for diagnosing dairy cow diseases, improving farm economic benefits and reducing animal elimination rates. Recently, deep learning for automated dairy cow behavior identification has become a research focus. However, in complex farming environments, dairy cow behaviors are characterized by multiscale features due to large scenes and long data collection distances. Traditional behavior recognition models cannot accurately recognize similar behavior features of dairy cows, such as those with similar visual characteristics, i.e., standing and walking. The behavior recognition method based on 3D convolution solves the problem of small visual feature differences in behavior recognition. However, due to the large number of model parameters, long inference time, and simple data background, it cannot meet the demand for real-time recognition of dairy cow behaviors in complex breeding environments. To address this, we developed an effective yet lightweight model for fast and accurate dairy cow behavior feature learning from video data. We focused on four common behaviors: standing, walking, lying, and mounting. We recorded videos of dairy cow behaviors at a dairy farm containing over one hundred cows using surveillance cameras. A robust model was built using a complex background dataset. We proposed a two-pathway X3DFast model based on spatiotemporal behavior features. The X3D and fast pathways were laterally connected to integrate spatial and temporal features. The X3D pathway extracted spatial features. The fast pathway with R(2 + 1)D convolution decomposed spatiotemporal features and transferred effective spatial features to the X3D pathway. An action model further enhanced X3D spatial modeling. Experiments showed that X3DFast achieved 98.49% top-1 accuracy, outperforming similar methods in identifying the four behaviors. The method we proposed can effectively identify similar dairy cow behaviors while improving inference speed, providing technical support for subsequent dairy cow behavior recognition and daily behavior statistics.
- Research Article
5
- 10.1093/comjnl/bxad130
- Dec 29, 2023
- The Computer Journal
This paper proposes a novel video hashing with tensor robust Principal Component Analysis (PCA) and Histogram of Optical Flow (HOF) for copy detection. In the proposed hashing, a video is divided into some video groups. For each video group, a low-rank secondary frame is constructed from the low-rank component decomposed by applying tensor robust PCA to the video group. Since the low-rank component can well indicate spatial-temporal intrinsic structure of the video group and it is slightly disturbed by digital operations, feature extraction from the low-rank secondary frames is discriminative and stable. Next, spatial features and temporal features are extracted from low-rank secondary frames by Charlier moments and HOF, respectively. Since the Charlier moments are robust to geometric transform and they can efficiently distinguish video frames with different contents, the use of Charlier moments can make robust and discriminative spatial features. As the HOF can measure the distribution of motion information between frames, the temporal features formed by HOFs can provide good discrimination. Hash is ultimately determined by quantizing the spatial and temporal features and concatenating the quantized results. Numerous experiments on open video datasets indicate that the proposed hashing is superior to some hashing baseline schemes in terms of classification and copy detection.
- Conference Article
28
- 10.1109/bigmm.2015.82
- Apr 1, 2015
This paper proposes a framework for recognizing human actions from depth video sequences by designing a novel feature descriptor based on Depth Motion Maps (DMMs), Contour let Transform (CT) and Histogram of Oriented Gradients (HOGs). First, CT is implemented on the generated DMMs of a depth video sequence and then HOGs are computed for each contour let sub-band. Finally, the concatenation of these HOG features is used as a feature descriptor for the depth video sequence. With this new feature descriptor, the l2-regularized collaborative representation classifier is utilized to recognize human actions. The experimental results on Microsoft Research Action3D dataset demonstrate that our proposed method can achieve the state-of-the-art performance for human activity recognition due to the precise feature extraction of contour let transform on the DMMs.
- Book Chapter
29
- 10.1007/978-3-319-27857-5_55
- Jan 1, 2015
This paper presents a new method for human activity recognition using depth sequences. Each depth sequence is represented by three depth motion maps (DMMs) from three projection views (front, side and top) to capture motion cues. A feature extraction method utilizing spatial and orientational auto-correlations of image local gradients is introduced to extract features from DMMs. The gradient local auto-correlations (GLAC) method employs second order statistics (i.e., auto-correlations) to capture richer information from images than the histogram-based methods (e.g., histogram of oriented gradients) which use first order statistics (i.e., histograms). Based on the extreme learning machine, a fusion framework that incorporates feature-level fusion into decision-level fusion is proposed to effectively combine the GLAC features from DMMs. Experiments on the MSRAction3D and MSRGesture3D datasets demonstrate the effectiveness of the proposed activity recognition algorithm.
- Conference Article
2
- 10.1109/hsi.2015.7170692
- Jun 1, 2015
Nowadays, more and more activity recognition algorithms begin to improve recognition performance by combining the RGB and depth information. Although, the space-time volumes (STV) algorithm and the space-time local features algorithm can combine the RGB and depth information effectively, they also have their own defects. Such as they need expensive computational cost and they are not suitable for modeling nonperiodic activity. In this paper, we propose a novel algorithm for three dimensional human activity recognition that combines spatial-domain local texture features and spatio-temporal local texture features. On the one hand, in order to extract spatial local texture features, we mix the RGB and depth image sequence which have been applied with ViBe (Visual Background extractor) and binarization operator. Then we obtain the RGB-MOHBBI and depth-MOBHBI respectively and perform intersect operation on them. Afterwards, we extract LBP feature from the mixed MOHBBI to describe spatial domain feature. On the other hand, we follow the same background subtraction and binarization method to process the RGB and depth image sequences and get the spatial-temporal local texture features. And then, we project the three dimensional image volume on plane X-T and plane Y-T to get the spatio-temporal behavior volume change image to which we apply LBP operator to extract features that can represent human activity feature in spatio-temporal domain. At last, we combine the two local features that are extracted by LBP algorithm as one integrated feature of our model final output. Extensive experiments are conducted on the BUPT Arm Activity Dataset and the BUPT Arm And Finger Activity Dataset. The experimental results demonstrate the algorithm we proposed in this paper can make up for the deficiency of traditional activity recognition algorithms effectively and provide excellent experiment results on different databases of various complexities.
- Conference Article
3
- 10.1109/icacci.2018.8554583
- Sep 1, 2018
Hand detection is the vital step towards developing a gesture recognition system. Robust hand detection is a challenging task and needs a deeper investigation of hand-oriented features under practical conditions. Existing texture features such as Histogram of oriented gradients (HOG), and Gabor feature are efficient but requires high extraction time due to their dense nature. If the feature is extracted from an edge-filtered imaged, only the vital edge features will be processed while reducing the computation and time complexity. Therefore present work proposes a bit-plane based feature extraction approach. Also, a new texture feature is proposed, Gradient Local Auto-Correlations (GLAC) that extracts the 2nd order statistical parameters such as curvature statistics unlike HOG, Gabor, and histogram feature. GLAC is also modified to GLACgrid feature to extract local texture feature by using spatial binning grids of $2\times 3$ with 5 orientation bins. Experimental observations showed that performance of GLACgrid feature is approx. 3.5%, 10.6%, and 19% higher than HOG, Gabor and histogram feature, respectively. Evaluation models are developed using Naive Bayes classifier, Real AdaBoost, Gentle AdaBoost, Modest AdaBoost, support vector machine (SVM). Response time of bit-plane GLAC features are considerably lower than HOG and Gabor feature, which makes it an efficient candidate for realtime hand detection systems.
- Conference Article
35
- 10.1109/jurse.2017.7924590
- Mar 1, 2017
In this study we seek to map urban poverty in Colombo, Sri Lanka using spectral and spatial features estimated from high spatial resolution satellite imagery. For this study we calculated 165 spectral and spatial features at a block size of 16m and a range of scales, from three Quickbird scenes, collected in 2010 which cover 316 Grama Niladhari (GN) census units within the District of Colombo and includes the urban area of Colombo, Sri Lanka. The features calculated include linear support regions (LSR), linear binary pattern moments (LBPM), PanTex, Histogram of Oriented Gradients (HoG), Speeded Up Robust Features (SURF), Fourier Transform (FT), Gabor, the mean of each of the blue, green, red, and near-infrared spectral bands, as well as the Normalized Difference Vegetation Index (NDVI). For each GN census unit (avg. size of 2.17 sq. km), the zonal sum, mean, and standard deviation of all 165 features were calculated. For each GN unit, the 10/20/30/40 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">th</sup> percentiles of the national distribution of household estimates of predicted per capita consumption were calculated using data from the 2011 Sri Lankan Census to provide an estimate of poverty. Results indicate that the combined spatial and spectral features were able to explain up to 54% of the variation in poverty when using a simple, ordinary least squares linear regression model.
- Research Article
8
- 10.11591/ijeecs.v25.i2.pp892-899
- Feb 1, 2022
- Indonesian Journal of Electrical Engineering and Computer Science
Humans can perform an enormous number of actions like running, walking, pushing, and punching, and can perform them in multiple ways. Hence recognizing a human action from a video is a challenging task. In a supervised learning environment, actions are first represented using robust features and then a classifier is trained for classification. The selection of a classifier does affect the performance of human action recognition. This work focuses on the comparison of two structures of the neural network, namely, feed forward neural network and cascade forward neural network, for human action recognition. Histogram of oriented gradients (HOG) and histogram of optical flow (HOF) are used as features for representing the actions. HOG represents the spatial features of the video while HOF gives motion features of the video. The performance of two neural network architectures is compared based on recognition accuracy. Well-known publically available datasets for action and interaction detection are used for testing. It is seen that, for human action recognition applications, feed forward neural network gives better results in terms of higher recognition accuracy than Cascade forward neural network.
- Conference Article
2
- 10.1109/cac53003.2021.9727556
- Oct 22, 2021
Accurate segmentation of manual assembly action in an uncertain assembly scene is the premise and important foundation of robot autonomous learning to obtain action sequence. Therefore, this paper proposes a method of hand assembly action segmentation based on spatiotemporal features. This method takes the RGB-D video in the process of manual assembly demonstration as the research object. Firstly, the assembly scene graph of each video frame is constructed. On this basis, the spatial features of each video frame are extracted by using the graph network model. Then, the multi-stage temporal convolution network is used to process the spatial features in the time dimension to obtain the spatial and temporal features of each video frame. The spatiotemporal features pass through the softmax layer to obtain the recognition results of each frame of video, and the adjacent frames with the same action type are combined to obtain the manual assembly action sequence. This method obtains the temporal relationship between the front and back actions and avoids the problems of unaligned action boundaries and unsmooth action fragments in the hand assembly action segmentation method based on a single spatial feature. The experimental results show that the action editing score is improved from 78.18% to 99.28%, which verifies the effectiveness of the method.
- Research Article
4
- 10.3390/electronics11152283
- Jul 22, 2022
- Electronics
The trajectory data of aircraft, ships, and so on, can be analyzed to obtain valuable information. Clustering is the basic technology of trajectory analysis, and the feature extraction process is one of the decisive factors for clustering performance. Trajectory features can be divided into two categories: spatial features and temporal features. In mainstream algorithms, spatial features are represented by latitude and longitude coordinates. However, such algorithms are only suitable for trajectories where spatial features are tightly coupled with latitude and longitude. When the same types of trajectories are in different latitude and longitude ranges or there are transformations such as rotation, scaling, and so on, this kind of algorithm is infeasible. Therefore, this paper proposes a spatio-temporal feature trajectory clustering algorithm based on deep learning. In this algorithm, the extraction process of the trajectory spatial shape feature is designed based on image matching technology, and the extracted spatial features are combined with the trajectory temporal features to improve the clustering performance. The experimental results on simulated and real datasets show that the algorithm can effectively extract the trajectory spatial shape features and that the clustering effect of the fused spatio-temporal feature is better than that of a single feature.
- Research Article
1
- 10.14738/aivp.65.5340
- Oct 31, 2018
- Advances in Image and Video Processing
Research on human action recognition from depth video sequences are increasing day by day due to its vast application in automatic surveillance systems, entertainment environments, and healthcare systems etc. In our project, we improve human action recognition accuracy using shape features. We use Histogram of oriented gradients (HOG) and Pyramid Histogram of oriented gradients ( PHOG) to extract shape features. The feature extraction algorithms are used to extract shape feature from dataset of different action videos. At first, depth motion maps (DMMs) are constructed from every action video. Then, the HOG and PHOG features are extracted from each DMMs. Using these features, actions are recognized by the
- Conference Article
5
- 10.1109/ism.2016.0058
- Dec 1, 2016
We propose a 3D action recognition algorithm which uses depth-based Gradient Local Auto-Correlations (GLAC) feature and Locality-constrained Affine Subspace Coding (LASC) to improve the discriminative ability of human actions in spatio-temporal subsequences of 3D depth videos. First, each entire depth video sequence is divided automatically into a set of subsequences (i.e., multi-scale sub-actions) by the normalized motion energy vector. Next Depth Motion Maps (DMMs) based GLAC features are employed to capture the shape information and motion cues of each sub-action. In order to obtain a more compact and discriminative representation, LASC is then proposed to encode the features extracted from the depth video. We show that the use of LASC exhibits better performance compared to existing methods such as Locality-constrained Linear Coding (LLC). On all three datasets we obtain competitive results compared to fifteen methods, while using fewer features and less complex models.
- Research Article
1
- 10.3934/era.2022210
- Jan 1, 2022
- Electronic Research Archive
<abstract> <p>Human behavior recognition has always been a hot spot for research in computer vision. In this paper, we propose a novel video behavior recognition method based on Actional-Structural Graph Convolution and a Temporal Extension Module under the framework of a Spatio-Temporal Graph Convolution Neural Network, which can optimize the spatial and temporal features simultaneously. The basic network framework of our method consists of three parts: spatial graph convolution module, temporal extension module and attention mechanism module. In the spatial dimension, the action graph convolution is utilized to obtain abundant spatial features by capturing the correlations of distant joint features, and the structural graph convolution expands the existing skeleton graph to acquire the spatial features of adjacent joints. In the time dimension, the sampling range of the temporal graph is expanded for extracting the same and adjacent joints of adjacent frames. Furthermore, attention mechanisms are introduced to improve the performance of our method. In order to verify the effectiveness and accuracy of our method, a large number of experiments were carried out on two standard behavior recognition datasets: NTU-RGB+D and Kinetics. Comparative experiment results show that our proposed method can achieve better performance.</p> </abstract>
- Research Article
2
- 10.1364/optica.543225
- Feb 25, 2025
- Optica
Automatically detecting bacteria from pathological sections is of great significance in clinical practice, providing precious information for accurate decision-making in disease diagnosis and treatment. However, traditional bacterial identification methods require professional medical equipment and operations, making them costly and time-consuming. Learning-based methods can detect bacteria through spatial features, but their accuracy is unsatisfactory due to the limited modeling capabilities of existing deep-learning models. Considering that RGB images contain both spatial and spectral target information, here we propose to investigate the latent spectral features to enhance accurate and efficient bacterial detection. Specifically, we first performed hyperspectral image (HSI) reconstruction from RGB bacterial images, which can investigate underlying spectral features without additional cumbersome and expensive spectral imaging systems. The HSI reconstruction network builds on the spatial-frequency block (SF-block) under the U-shaped architecture. The SF-block combines frequency-wise self-attention (FWSA) and spatial-wise local-window self-attention (LWSA) modules in a parallel design. Such a framework can effectively model the spatial sparsity of bacteria and the interspectral similarity of HSI. It also enables the complementary fusion of spatial and spectral features, establishes cross-window connections, and expands the receptive field while maintaining linear complexity. Then, by stacking SF-blocks at multiple scales, we can effectively detect bacteria from the reconstructed HSIs by integrating both spectral and spatial features, and output each bacterium’s location, size, and category. We constructed a large-scale bacterial detection data set for network training and testing that contains 2910 labeled images over four common bacterial categories. Extensive experiments show that our method achieved state-of-the-art bacterial detection accuracy of 92.4% at a speed of 11 FPS, which is 3 orders of magnitude faster than traditional methods.