Micro-Expression Recognition Based On 3DCNN Combined With GRU and New Attention Mechanism
Micro-expressions, as a form of non-verbal emotional expressions, play a key role in interpersonal interaction. However, they are also quite challenging and not easy to analyze. In this paper, we propose a dual-branch shallow 3DCNN architecture that combines Gated Recurrent Unit (GRU) and enhances the Channel Attention Module (CAM) in Convolutional Block Attention Module (CBAM) to make it more suitable for recognizing micro facial expressions. Experiments show that the proposed method can achieve good results with a relatively simple architecture. The source code and pre-trained models are available at https://github.com/dannyFan-0201/ICIP_2024.
- Research Article
20
- 10.1155/2021/7799100
- Aug 17, 2021
- Journal of Healthcare Engineering
Microexpression can manifest the real mood of humans, which has been widely concerned in clinical diagnosis and depression analysis. To solve the problem of missing discriminative spatiotemporal features in a small data set caused by the short duration and subtle movement changes of microexpression, we present a dual-stream spatiotemporal attention network (DSTAN) that integrates dual-stream spatiotemporal network and attention mechanism to capture the deformation features and spatiotemporal features of microexpression in the case of small samples. The Spatiotemporal networks in DSTAN are based on two lightweight networks, namely, the spatiotemporal appearance network (STAN) learning the appearance features from the microexpression sequences and the spatiotemporal motion network (STMN) learning the motion features from optical flow sequences. To focus on the discriminative motion areas of microexpression, we construct a novel attention mechanism for the spatial model of STAN and STMN, including a multiscale kernel spatial attention mechanism and global dual-pool channel attention mechanism. To obtain the importance of each frame in the microexpression sequence, we design a temporal attention mechanism for the temporal model of STAN and STMN to form spatiotemporal appearance network-attention (STAN-A) and spatiotemporal motion network-attention (STMN-A), which can adaptively perform dynamic feature refinement. Finally, the feature concatenate-SVM method is used to integrate STAN-A and STMN-A to a novel network, DSTAN. The extensive experiments on three small spontaneous microexpression data sets of SMIC, CASME, and CASME II demonstrate the proposed DSTAN can effectively cope with the recognition of microexpressions.
- Research Article
46
- 10.3390/e25030460
- Mar 6, 2023
- Entropy
Micro-expression recognition (MER) is challenging due to the difficulty of capturing the instantaneous and subtle motion changes of micro-expressions (MEs). Early works based on hand-crafted features extracted from prior knowledge showed some promising results, but have recently been replaced by deep learning methods based on the attention mechanism. However, with limited ME sample sizes, features extracted by these methods lack discriminative ME representations, in yet-to-be improved MER performance. This paper proposes the Dual-branch Attention Network (Dual-ATME) for MER to address the problem of ineffective single-scale features representing MEs. Specifically, Dual-ATME consists of two components: Hand-crafted Attention Region Selection (HARS) and Automated Attention Region Selection (AARS). HARS uses prior knowledge to manually extract features from regions of interest (ROIs). Meanwhile, AARS is based on attention mechanisms and extracts hidden information from data automatically. Finally, through similarity comparison and feature fusion, the dual-scale features could be used to learn ME representations effectively. Experiments on spontaneous ME datasets (including CASME II, SAMM, SMIC) and their composite dataset, MEGC2019-CD, showed that Dual-ATME achieves better, or more competitive, performance than the state-of-the-art MER methods.
- Conference Article
- 10.1117/12.2671575
- Apr 29, 2023
Micro-expressions are rapid, difficult to observe with the naked eye, facial expressions that can reflect real human inner emotions. Micro-expression recognition is still a great challenge due to the characteristics of very short duration and subtle changes (small amplitude of muscle contraction or diastole). Based on this, this paper proposes a 3D convolutional micro-expression recognition method based on attention mechanism, which is a dual-stream structure that can effectively utilize the features of image sequence and optical flow sequence. More effective micro-expression features are extracted using Attention layer, Co-Attention layer to better solve the micro-expression recognition task. Adequate experiments are conducted on the dataset to verify that the model has better recognition results.
- Research Article
2
- 10.1016/j.vrih.2022.03.006
- Apr 1, 2023
- Virtual Reality & Intelligent Hardware
Adaptive spatio-temporal attention neural network for cross-database micro-expression recognition
- Research Article
5
- 10.32604/cmc.2022.028801
- Jan 1, 2022
- Computers, Materials & Continua
Micro-expression is manifested through subtle and brief facial movements that relay the genuine person’s hidden emotion. In a sequence of videos, there is a frame that captures the maximum facial differences, which is called the apex frame. Therefore, apex frame spotting is a crucial sub-module in a micro-expression recognition system. However, this spotting task is very challenging due to the characteristics of micro-expression that occurs in a short duration with low-intensity muscle movements. Moreover, most of the existing automated works face difficulties in differentiating micro-expressions from other facial movements. Therefore, this paper presents a deep learning model with an attention mechanism to spot the micro-expression apex frame from optical flow images. The attention mechanism is embedded into the model so that more weights can be allocated to the regions that manifest the facial movements with higher intensity. The method proposed in this paper has been tested and verified on two spontaneous micro-expression databases, namely Spontaneous Micro-facial Movement (SAMM) andChinese Academy of Sciences Micro-expression (CASME) II databases. The proposed system performance is evaluated by using the Mean Absolute Error (MAE) metric that measures the distance between the predicted apex frame and the ground truth label. The best MAE of 14.90 was obtained when a combination of five convolutional layers, local response normalization, and attention mechanism is used to model the apex frame spotting. Even with limited datasets, the results have proven that the attention mechanism has better emphasized the regions where the facial movements likely to occur and hence, improves the spotting performance.
- Research Article
11
- 10.3389/fnins.2023.1216181
- Jul 27, 2023
- Frontiers in Neuroscience
Micro-expressions are facial muscle movements that hide genuine emotions. In response to the challenge of micro-expression low-intensity, recent studies have attempted to locate localized areas of facial muscle movement. However, this ignores the feature redundancy caused by the inaccurate locating of the regions of interest. This paper proposes a novel multi-scale fusion visual attention network (MFVAN), which learns multi-scale local attention weights to mask regions of redundancy features. Specifically, this model extracts the multi-scale features of the apex frame in the micro-expression video clips by convolutional neural networks. The attention mechanism focuses on the weights of local region features in the multi-scale feature maps. Then, we mask operate redundancy regions in multi-scale features and fuse local features with high attention weights for micro-expression recognition. The self-supervision and transfer learning reduce the influence of individual identity attributes and increase the robustness of multi-scale feature maps. Finally, the multi-scale classification loss, mask loss, and removing individual identity attributes loss joint to optimize the model. The proposed MFVAN method is evaluated on SMIC, CASME II, SAMM, and 3DB-Combined datasets that achieve state-of-the-art performance. The experimental results show that focusing on local at the multi-scale contributes to micro-expression recognition. This paper proposed MFVAN model is the first to combine image generation with visual attention mechanisms to solve the combination challenge problem of individual identity attribute interference and low-intensity facial muscle movements. Meanwhile, the MFVAN model reveal the impact of individual attributes on the localization of local ROIs. The experimental results show that a multi-scale fusion visual attention network contributes to micro-expression recognition.
- Research Article
146
- 10.1016/j.patcog.2021.108275
- Aug 25, 2021
- Pattern Recognition
Feature refinement: An expression-specific feature learning and fusion method for micro-expression recognition
- Research Article
12
- 10.1007/s00530-022-00934-6
- Jun 3, 2022
- Multimedia Systems
Micro-expression recognition with attention mechanism and region enhancement
- Research Article
16
- 10.3390/s23125650
- Jun 16, 2023
- Sensors (Basel, Switzerland)
In the billions of faces that are shaped by thousands of different cultures and ethnicities, one thing remains universal: the way emotions are expressed. To take the next step in human-machine interactions, a machine (e.g., a humanoid robot) must be able to clarify facial emotions. Allowing systems to recognize micro-expressions affords the machine a deeper dive into a person's true feelings, which will take human emotion into account while making optimal decisions. For instance, these machines will be able to detect dangerous situations, alert caregivers to challenges, and provide appropriate responses. Micro-expressions are involuntary and transient facial expressions capable of revealing genuine emotions. We propose a new hybrid neural network (NN) model capable of micro-expression recognition in real-time applications. Several NN models are first compared in this study. Then, a hybrid NN model is created by combining a convolutional neural network (CNN), a recurrent neural network (RNN, e.g., long short-term memory (LSTM)), and a vision transformer. The CNN can extract spatial features (within a neighborhood of an image), whereas the LSTM can summarize temporal features. In addition, a transformer with an attention mechanism can capture sparse spatial relations residing in an image or between frames in a video clip. The inputs of the model are short facial videos, while the outputs are the micro-expressions recognized from the videos. The NN models are trained and tested with publicly available facial micro-expression datasets to recognize different micro-expressions (e.g., happiness, fear, anger, surprise, disgust, sadness). Score fusion and improvement metrics are also presented in our experiments. The results of our proposed models are compared with that of literature-reported methods tested on the same datasets. The proposed hybrid model performs the best, where score fusion can dramatically increase recognition performance.
- Research Article
1
- 10.3390/electronics13204012
- Oct 12, 2024
- Electronics
Microexpressions are subtle facial movements that occur within an extremely brief time frame, often revealing suppressed emotions. These expressions hold significant importance across various fields, including security monitoring and human–computer interaction. However, the accuracy of microexpression recognition is severely constrained by the inherent characteristics of these expressions. To address the issue of low detection accuracy regarding the subtle features present in microexpressions’ facial action units, this paper proposes a microexpression action unit detection algorithm, Attention-embedded Dual Path and Shallow Three-stream Networks (ADP-DSTN), that incorporates an attention-embedded dual path and a shallow three-stream network. First, an attention mechanism was embedded after each Bottleneck layer in the foundational Dual Path Networks to extract static features representing subtle texture variations that have significant weights in the action units. Subsequently, a shallow three-stream 3D convolutional neural network was employed to extract optical flow features that were particularly sensitive to temporal and discriminative characteristics specific to microexpression action units. Finally, the acquired static facial feature vectors and optical flow feature vectors were concatenated to form a fused feature vector that encompassed more effective information for recognition. Each facial action unit was then trained individually to address the issue of weak correlations among the facial action units, thereby facilitating the classification of microexpression emotions. The experimental results demonstrated that the proposed method achieved great performance across several microexpression datasets. The unweighted average recall (UAR) values were 80.71%, 89.55%, 44.64%, 80.59%, and 88.32% for the SAMM, CASME II, CAS(ME)3, SMIC, and MEGC2019 datasets, respectively. The unweighted F1 scores (UF1) were 79.32%, 88.30%, 43.03%, 81.12%, and 88.95%, respectively. Furthermore, when compared to the benchmark model, our proposed model achieved better performance with lower computational complexity, characterized by a Floating Point Operations (FLOPs) value of 1087.350 M and a total of 6.356 × 106 model parameters.
- Research Article
5
- 10.1088/1742-6596/2504/1/012062
- May 1, 2023
- Journal of Physics: Conference Series
To address the problems of low accuracy of existing deep learning-based micro-expression recognition models, numerous network parameters, and the difficulty of mobile deployment of micro-expression recognition models, this paper proposes DCBAM-EfficientNet, a micro-expression recognition model that uses the lightweight network EfficientNet as the backbone network and incorporates the attention module. The network can guarantee the accuracy of micro-expression recognition with relatively few network parameters. The attention mechanism allows the more expressive micro-expression features to be highlighted, and the CBAM attention is improved into a DCBAM model, where the large convolution kernel in the spatial attention module of CBAM is replaced by a dilated convolution with the same receptive field, reducing the network parameters while better preserving the spatial features of the image. The integration of the DCBAM model into the main structure of EfficientNet enables better integration of contextual information. Data enhancement is used to process the micro-expression dataset to decrease the occurrence of overfitting and improve the generalization ability of the model. The results demonstrate that the optimized model DCBAM-EfficientNet can effectively promote the recognition accuracy of micro-expressions, significantly reduce the quantity and volume of model parameters, and provide a reference for the deployment of mobile micro-expression recognition models.
- Research Article
5
- 10.1007/s11760-024-03221-1
- May 8, 2024
- Signal, Image and Video Processing
Micro-expression recognition using a multi-scale feature extraction network with attention mechanisms
- Research Article
- 10.3390/computation13120277
- Dec 1, 2025
- Computation
Effective communication between deaf–mute and visually impaired individuals remains a challenge in the fields of human–computer interaction and accessibility technology. Current solutions mostly rely on single-modal recognition, which often leads to issues such as semantic ambiguity and loss of emotional information. To address these challenges, this study proposes a lightweight multimodal fusion framework that combines gestures and micro-expressions, which are then processed through a recognition network and a speech synthesis module. The core innovations of this research are as follows: (1) a lightweight YOLOv5s improvement structure that integrates residual modules and efficient downsampling modules, which reduces the model complexity and computational overhead while maintaining high accuracy; (2) a multimodal fusion method based on an attention mechanism, which adaptively and efficiently integrates complementary information from gestures and micro-expressions, significantly improving the semantic richness and accuracy of joint recognition; (3) an end-to-end real-time system that outputs the visual recognition results through a high-quality text-to-speech module, completing the closed-loop from “visual signal” to “speech feedback”. We conducted evaluations on the publicly available hand gesture dataset HaGRID and a curated micro-expression image dataset. The results show that, for the joint gesture and micro-expression tasks, our proposed multimodal recognition system achieves a multimodal joint recognition accuracy of 95.3%, representing a 4.5% improvement over the baseline model. The system was evaluated in a locally deployed environment, achieving a real-time processing speed of 22 FPS, with a speech output latency below 0.8 s. The mean opinion score (MOS) reached 4.5, demonstrating the effectiveness of the proposed approach in breaking communication barriers between the hearing-impaired and visually impaired populations.
- Research Article
24
- 10.1109/taffc.2022.3197785
- Oct 1, 2022
- IEEE Transactions on Affective Computing
Micro-expression recognition ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MER</i> ) has attracted the attention of many researchers in the past decade. However, occlusion occurs for MER in real-world scenarios. In this paper, a challenging issue in MER that is interesting but unexplored, i.e., occlusion MER, is deeply investigated. First, to research MER under real-world occlusion conditions, synthetic occluded microexpression databases are created by using various community masks. Second, to suppress the influence of occlusion, a <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">R</u> egion-inspired <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">R</u> elation <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">R</u> easoning <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">N</u> etwork ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">RRRN</i> ) is proposed to model the relations between various facial regions. The RRRN consists of a backbone network, a region-inspired (RI) module and a relation reasoning (RR) module. More specifically, the backbone network aims to extract feature representations from different facial regions, the RI module is designed to compute the adaptive weight from the facial region itself based on the unobstructedness and importance of the region for suppressing the influence of occlusion using an attention mechanism, and the RR module exploits the progressive interactions among these regions by performing graph convolutions. Experiments are conducted on two tasks of MEGC 2018: the holdout-database evaluation task and the composite database evaluation task. Experimental results show that RRRN can be utilized to significantly explore the importance of facial regions and capture the cooperative complementary relationship of facial regions for MER. The results also demonstrate that RRRN outperforms the state-of-the-art approaches, especially with respect to occlusion, where RRRN is more robust.
- Book Chapter
11
- 10.1007/978-3-030-89188-6_20
- Jan 1, 2021
Micro-expression recognition is a video sentiment classification task with extremely small sample size. The transience and spatial locality of micro-expressions bring difficulties to constructing large micro-expression databases and designing micro-expression recognition algorithms. To reach the balance between classification accuracy and model complexity in this domain, we propose a lightweight neural micro-expression recognizer, Off-TANet, which is based on apex-onset optical flow features. The neural network contains a simple yet powerful triplet attention mechanism, and the powerfulness of this design could be interpreted in 2 aspects, FACS AU and matrix sparseness. The model evaluation is conducted with a LOSO cross-validation strategy on a combined database including 3 mainstream micro-expression databases. With obviously fewer total parameters (59,403), the results of the experiment indicate that the model achieves an average recall of 0.7315 and an average F1-score of 0.7242, exceeding other major architectures in this domain. A series of ablation experiments are also conducted to ensure the validity of our model design.