Abstract

Automatic perception and understanding of human emotion is becoming an increasingly attractive research field in artificial intelligence and human-computer interaction. Emotion portrayal within conversation plays a significant role in the semantics of a sentence. However, emotion is not only biologically determined but is also influenced by the environment. Therefore, cultural differences exist in some aspects of emotions, and it is important for the next generation of computer systems to adapt the cross-cultural difference in order to enable more naturalistic interactions between humans and machines. In this paper, we investigate the suitability of state-of-the-art deep learning architectures based on recurrent neural network (RNN) variants with explicit attention modelling to bridge the gap across different cultures (German and Hungarian) for emotion prediction in video. Three different attention based network architectures are proposed in this work:- early attention fusion, extended multi-attention fusion and attention-based encoder-decoder. Our RNN variants with explicit attention modelling approach achieves very promising Concordance Correlation Coefficient results, which outperform the baseline on Arousal of 0.637 vs. 0.614 (baseline), for Valence of 0.689 vs. 0.615 and for Liking of 0.625 vs. 0.222.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call