Skeleton-Based Emotion Recognition Based on Two-Stream Self-Attention Enhanced Spatial-Temporal Graph Convolutional Network.

Jiaqi Shi,Carlos Toshinori Ishi,Chaoran Liu,Hiroshi Ishiguro

doi:10.3390/s21010205

Abstract

Emotion recognition has drawn consistent attention from researchers recently. Although gesture modality plays an important role in expressing emotion, it is seldom considered in the field of emotion recognition. A key reason is the scarcity of labeled data containing 3D skeleton data. Some studies in action recognition have applied graph-based neural networks to explicitly model the spatial connection between joints. However, this method has not been considered in the field of gesture-based emotion recognition, so far. In this work, we applied a pose estimation based method to extract 3D skeleton coordinates for IEMOCAP database. We propose a self-attention enhanced spatial temporal graph convolutional network for skeleton-based emotion recognition, in which the spatial convolutional part models the skeletal structure of the body as a static graph, and the self-attention part dynamically constructs more connections between the joints and provides supplementary information. Our experiment demonstrates that the proposed model significantly outperforms other models and that the features of the extracted skeleton data improve the performance of multimodal emotion recognition.

Highlights

Multimodal emotion recognition has attracted a lot of attention due to its wide range of application scenarios
We construct a skeleton enhanced emotion recognition network (SERN), which integrates text and audio information with the features extracted by the self-attention enhanced spatial temporal graph convolutional network (See Figure 6)
Considering the unbalance of the samples, the unweighted average recall was used to evaluate the model by treating each category

Summary

Introduction

Multimodal emotion recognition has attracted a lot of attention due to its wide range of application scenarios. Research using speech signals, textual transcriptions and facial expressions mostly evaluate their models on large open-source multimodal emotional benchmark datasets, such as the interactive emotional dyadic motion capture database (IEMOCAP, over 10000 samples) [6] These databases do not contain skeleton data representing the gesture modality, which makes them difficult to use in gesture emotion recognition. A strong connection between these joints is likely necessary, but the fixed graph structure does not guarantee that the network can capture the appropriate dependency To solve these problems, we make the following contributions: (i) we extract 3D skeleton movement data from raw video based on pose estimation and the method can be used to expand existing databases to alleviate the lack of labeled data. The performance significantly exceeds that of the bimodal model using only audio and text information, which shows the effectiveness of the extracted modality

Emotion Recognition

Gesture-Based Emotion Recognition

Graph Neural Networks

Skeletal Data Extraction

Human Pose Estimation

Data Preprocessing

Skeleton Graph Construction

Self-Attention Enhanced Spatial Graph Convolutional Layer

Self-Attention Enhanced Spatial Temporal Graph Convolutional Network

Two-Stream Architecture

Multimodal Emotion Recognition Network

Dataset

Feature Extraction and Experiment Setting

Results

Effect of the Preprocessing

Gating Mechanism

Effect of the Bone Information

Multimodal Analysis

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors (Basel, Switzerland)	Publication Date: Dec 30, 2020
Citations: 18	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Skeleton-Based Emotion Recognition Based on Two-Stream Self-Attention Enhanced Spatial-Temporal Graph Convolutional Network.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)

Lead the way for us

Similar Papers

A Self-Attention Augmented Graph Convolutional Clustering Networks for Skeleton-Based Video Anomaly Behavior Detection
Chengming Liu ... Yinghao Li
Applied sciences | VOL. 12
Chengming Liu, et. al.Chengming Liu ... Yinghao Li
21 Dec 2021
Applied sciences | VOL. 12

A Survey on 3D Skeleton-Based Action Recognition Using Learning Method.
Bin Ren ... Hong Liu
Cyborg and bionic systems (Washington, D.C.) | VOL. 5
Bin Ren, et. al.Bin Ren ... Hong Liu
01 Jan 2024
Cyborg and bionic systems (Washington, D.C.) | VOL. 5

Spatial Temporal Variation Graph Convolutional Networks (STV-GCN) for Skeleton-Based Emotional Action Recognition
Ming-Fong Tsai ... Chiung-Hung Chen
IEEE access : practical innovations, open solutions | VOL. 9
Ming-Fong Tsai, et. al.Ming-Fong Tsai ... Chiung-Hung Chen
01 Jan 2020
IEEE access : practical innovations, open solutions | VOL. 9

Enhanced Spatial and Extended Temporal Graph Convolutional Network for Skeleton-Based Action Recognition.
Fanjia Li ... Juanjuan Li
Sensors (Basel, Switzerland) | VOL. 20
Fanjia Li, et. al.Fanjia Li ... Juanjuan Li
15 Sep 2020
Sensors (Basel, Switzerland) | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Skeleton-Based Emotion Recognition Based on Two-Stream Self-Attention Enhanced Spatial-Temporal Graph Convolutional Network.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)