Deep Encoded Linguistic and Acoustic Cues for Attention Based End to End Speech Emotion Recognition

Swapnil Bhosale,Sunil Kumar Kopparapu,Rupayan Chakraborty

doi:10.1109/icassp40776.2020.9054621

Abstract

An End-to-End model with convolutional layers and multi-head self attention mechanism is proposed for Speech Emotion Recognition (SER) task. As inputs, we propose to use both the deep encoded linguistic features that carry the language related context of emotion and the audio spectrogram that are representatives of acoustic cues. To facilitate the deep linguistic feature representation, we use outputs from the intermediate layers of a pre-trained Automatic Speech Recognition (ASR) model, where the layer is selected empirically. The influence of both acoustic and linguistic features, both separately and in combination, for emotion recognition in different scenarios (scripted and spontaneous recording of emotional speech samples) have been studied. Extensive experiments on the standard IEMOCAP database are conducted to investigate the efficacy of our proposed approach. To address the class imbalance, we carried out down sampling and ensembling, which further improved the SER accuracy. Overall, we observe that the acoustic features perform best for improvised recordings which is due to the spontaneity in speech with less linguistic correlation. But the linguistic features are found to be effective for the scripted as well as for the combined (scripted and improvised recordings together) scenario that reflects more linguistic information in spoken utterances.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Deep Encoded Linguistic and Acoustic Cues for Attention Based End to End Speech Emotion Recognition

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Emotion Recognition Combining Acoustic and Linguistic Features Based on Speech Recognition Results
Misaki Sakurai ... Tetsuo Kosaka
-
Misaki Sakurai, et. al.Misaki Sakurai ... Tetsuo Kosaka
12 Oct 2021
12 Oct 2021

Robust emotion recognition in noisy speech via sparse representation
Xiaoming Zhao ... Shiqing Zhang
Neural Computing and Applications | VOL. 24
Xiaoming Zhao, et. al.Xiaoming Zhao ... Shiqing Zhang
29 Mar 2013
Neural Computing and Applications | VOL. 24

Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion
Bagus Tris Atmaja ... Masato Akagi
Speech Communication | VOL. 140
Bagus Tris Atmaja, et. al.Bagus Tris Atmaja ... Masato Akagi
26 Mar 2022
Speech Communication | VOL. 140

Voice Emotion Recognition by Mandarin-Speaking Children with Cochlear Implants.
Lei Ren ... Junbo Zhang
Ear & Hearing | VOL. 43
Lei Ren, et. al.Lei Ren ... Junbo Zhang
07 Jul 2021
Ear & Hearing | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deep Encoded Linguistic and Acoustic Cues for Attention Based End to End Speech Emotion Recognition

Abstract

Talk to us

Similar Papers