STERM: A Multimodal Speech Emotion Recognition Model in Filipino Gaming Settings

Giorgio Armani G Magno,Jheanel E Estrada,Lhuijee Jhulo V Cuchapin

doi:10.1109/hnicem57413.2022.10109472

Abstract

Gaming is highly connected to emotion. Unfortunately, most game experience research has little or no connection to the emotion literature, which makes the emotion in games poorly understood. As technology and understanding of emotion are progressing, the researchers would like to take the opportunity to unfold discoveries that relate to recognizing the underlying emotions while playing Valorant, one of the trending online games nowadays. To recognize emotions, a model for speech emotion recognition must be developed. For emotion recognition in human speech, one can either extract emotion-related attributes from speech data or translate the speech dataset into its text equivalence prior to analyzing the data using natural language processing. Furthermore, emotion detection will benefit from the use of an audio-textual multimodal set-up, but it is not easily possible to devise a system that can learn from multimodality. It is either one can independently construct models for two input sources and aggregate them at the decision level. Inspired by this idea, the researchers in this paper proposed a speech emotion recognition model utilizing two modalities: speech and text. This study aims to discover the performance of a natural speech database consisting of in-game audio communications of Filipino gamers in multimodal emotion recognition and also, to detect the profane words uttered using audio and textual features. Employing deep learning algorithms like Convolutional Neural Networks (CNN) for speech and Natural Language Processing for recognizing emotions from the text as well as detecting the profane words that existed, results were evaluated in accordance with its statistical measures and then combined in order to evaluate the results and show the proposed approach achieves the state-of-the-art performance on the natural speech database.

Full Text