Abstract

An End-to-End model with convolutional layers and multi-head self attention mechanism is proposed for Speech Emotion Recognition (SER) task. As inputs, we propose to use both the deep encoded linguistic features that carry the language related context of emotion and the audio spectrogram that are representatives of acoustic cues. To facilitate the deep linguistic feature representation, we use outputs from the intermediate layers of a pre-trained Automatic Speech Recognition (ASR) model, where the layer is selected empirically. The influence of both acoustic and linguistic features, both separately and in combination, for emotion recognition in different scenarios (scripted and spontaneous recording of emotional speech samples) have been studied. Extensive experiments on the standard IEMOCAP database are conducted to investigate the efficacy of our proposed approach. To address the class imbalance, we carried out down sampling and ensembling, which further improved the SER accuracy. Overall, we observe that the acoustic features perform best for improvised recordings which is due to the spontaneity in speech with less linguistic correlation. But the linguistic features are found to be effective for the scripted as well as for the combined (scripted and improvised recordings together) scenario that reflects more linguistic information in spoken utterances.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call