An autoencoder-based feature level fusion for speech emotion recognition

Peng Shixin,Chen Kai,Tian Tian,Chen Jingying

doi:10.1016/j.dcan.2022.10.018

Abstract

Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Digital Communications and Networks	Publication Date: Oct 29, 2022
Citations: 11	License type: cc-by-nc-nd

R Discovery Prime

R Discovery Prime

An autoencoder-based feature level fusion for speech emotion recognition

Abstract

Talk to us

Similar Papers

More From: Digital Communications and Networks

Lead the way for us

Similar Papers

Speaker to Emotion: Domain Adaptation for Speech Emotion Recognition with Residual Adapters
Yuxuan Xi ... Lirong Dai
-
Yuxuan Xi, et. al.Yuxuan Xi ... Lirong Dai
01 Nov 2019
01 Nov 2019

Speech Emotion Recognition Method Using Depth Wavefield Extrapolation and Improved Wave Physics Model
Chunjun Zheng ... Ning Jia
-
Chunjun Zheng, et. al.Chunjun Zheng ... Ning Jia
01 Mar 2021
01 Mar 2021

Exploring the Influence of Noise in Speech Emotion Recognition Devices for Internet of Thing
Mingke Xu ... Fan Zhang
-
Mingke Xu, et. al.Mingke Xu ... Fan Zhang
01 Aug 2020
01 Aug 2020

Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network
Ngoc-Huynh Ho ... Hyung-Jeong Yang
IEEE Access | VOL. 8
Ngoc-Huynh Ho, et. al.Ngoc-Huynh Ho ... Hyung-Jeong Yang
01 Jan 2020
IEEE Access | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An autoencoder-based feature level fusion for speech emotion recognition

Abstract

Talk to us

Similar Papers

More From: Digital Communications and Networks