Abstract

Speech Emotion Recognition (SER) has been an active area of research to make Human–Computer Interaction (HCI) smoother and more natural. However, due to the dependence of the expressed emotions in an utterance on factors like culture, speaker, etc., the robustness of the SER systems in a multi-cultural setting is always a topic of discussion among researchers. Both the universalness and cultural specificity of emotions are debated in the literature. Thus we propose two methods, one incorporating cultural specificity and another demonstrating the universal nature of emotions across cultures. In this work, we propose a novel method to make a multi-cultural SER by incorporating impactful factors such as speaker and language as markers of cultural distinctiveness. We develop a language and a speaker model to get language and speaker embeddings, and a multi-modal fusion architecture is proposed to fuse the information along with emotional cues. Moreover, a triplet-loss-based multi-cultural SER is proposed, which tries to normalize speaker and cultural variabilities and focuses on learning emotions, irrespective of culture. Experiments conducted on a collection of five language emotion dataset shows the robustness of the proposed technique in predicting emotions in a leave-one-language-out setting. The design of the triplet loss-based system allows for the incorporation of a new language and speaker without the need to retrain the whole system again.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call