Abstract

Emotion recognition has been extensively studied in a single modality in the last decade. However, humans express their emotions usually through multiple modalities like voice, facial expressions, or text. This paper proposes a new method to learn a joint emotion representation for multimodal emotion recognition. Emotion-based feature for speech audio is learned by an unsupervised triplet-loss objective, and a text-to-text transformer network is used to extract text embedding for latent emotional meaning. Transfer learning provides a powerful and reusable technique to help fine-tune emotion recognition models trained on mega audio and text datasets respectively. The extracted emotional information from speech audio and text embedding are processed by dedicated transformer networks. The alternating co-attention mechanism is used to construct a deep transformer network. Multimodal fusion is implemented by a deep co-attention transformer network. Experimental results show the proposed method for learning a joint emotion representation achieves good performance in multimodal emotion recognition.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.