Emotion recognition by social robots is a serious challenge because sometimes people also do not cope with it. It is important to use information about emotions from all possible sources: facial expression, speech, or reactions occurring in the body. Therefore, a multimodal emotion recognition system was introduced, which includes the indicated sources of information and deep learning algorithms for emotion recognition. An important part of this system includes the speech analysis module, which was decided to be divided into two tracks: speech and text. An additional condition is the target language of communication, Polish, for which the number of datasets and methods is very limited. The work shows that emotion recognition using a single source—text or speech—can lead to low accuracy of the recognized emotion. It was therefore decided to compare English and Polish datasets and the latest deep learning methods in speech emotion recognition using Mel spectrograms. The most accurate LSTM models were evaluated on the English set and the Polish nEMO set, demonstrating high efficiency of emotion recognition in the case of Polish data. The conducted research is a key element in the development of a decision-making algorithm for several emotion recognition modules in a multimodal system.
Read full abstract