In terms of facial expressions, micro-expressions are more realistic than macro-expressions and provide more valuable information, which can be widely used in psychological counseling and clinical diagnosis. In the past few years, deep learning methods based on optical flow and Transformer have achieved excellent results in this field, but most of the current algorithms are mainly concentrated on establishing a serialized token through the self-attention model, and they do not take into account the spatial relationship between facial landmarks. For the locality and changes in the micro-facial conditions themselves, we propose the deep learning model MCCA-VNET on the basis of Transformer. We effectively extract the changing features as the input of the model, fusing channel attention and spatial attention into Vision Transformer to capture correlations between features in different dimensions, which enhances the accuracy of the identification of micro-expressions. In order to verify the effectiveness of the algorithm mentioned, we conduct experimental testing in the SAMM, CAS (ME) II, and SMIC datasets and compared the results with other former best algorithms. Our algorithms can improve the UF1 score and UAR score to, respectively, 0.8676 and 0.8622 for the composite dataset, and they are better than other algorithms on multiple indicators, achieving the best comprehensive performance.
Read full abstract