Abstract
Emotion recognition by social robots is a serious challenge because sometimes people also do not cope with it. It is important to use information about emotions from all possible sources: facial expression, speech, or reactions occurring in the body. Therefore, a multimodal emotion recognition system was introduced, which includes the indicated sources of information and deep learning algorithms for emotion recognition. An important part of this system includes the speech analysis module, which was decided to be divided into two tracks: speech and text. An additional condition is the target language of communication, Polish, for which the number of datasets and methods is very limited. The work shows that emotion recognition using a single source—text or speech—can lead to low accuracy of the recognized emotion. It was therefore decided to compare English and Polish datasets and the latest deep learning methods in speech emotion recognition using Mel spectrograms. The most accurate LSTM models were evaluated on the English set and the Polish nEMO set, demonstrating high efficiency of emotion recognition in the case of Polish data. The conducted research is a key element in the development of a decision-making algorithm for several emotion recognition modules in a multimodal system.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.