Abstract
An End-to-End model with convolutional layers and multi-head self attention mechanism is proposed for Speech Emotion Recognition (SER) task. As inputs, we propose to use both the deep encoded linguistic features that carry the language related context of emotion and the audio spectrogram that are representatives of acoustic cues. To facilitate the deep linguistic feature representation, we use outputs from the intermediate layers of a pre-trained Automatic Speech Recognition (ASR) model, where the layer is selected empirically. The influence of both acoustic and linguistic features, both separately and in combination, for emotion recognition in different scenarios (scripted and spontaneous recording of emotional speech samples) have been studied. Extensive experiments on the standard IEMOCAP database are conducted to investigate the efficacy of our proposed approach. To address the class imbalance, we carried out down sampling and ensembling, which further improved the SER accuracy. Overall, we observe that the acoustic features perform best for improvised recordings which is due to the spontaneity in speech with less linguistic correlation. But the linguistic features are found to be effective for the scripted as well as for the combined (scripted and improvised recordings together) scenario that reflects more linguistic information in spoken utterances.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.