Abstract
Emotion recognition is one of the widely studied topics in speech technology. Emotions that come from speech can contain useful information for many purposes. The main aspects in speech emotion recognition are speech features, speech corpus, and machine learning algorithms as the classifier method. In this paper, cross-corpus method is used to conduct Indonesian Speech Emotion Recognition (SER) along with the combination of Mel Frequency Cepstral Coefficients (MFCC) and Teager Energy features. Using Support Vector Machine (SVM) as classifier, the experiment result shows that applying cross-corpus method by adding corpora from other languages to the training dataset improves the emotion classification accuracy by 4.16% on MFCC Statistics feature and 2.09% on Teager-MFCC Statistics feature.
Highlights
Nowadays we are experiencing a rapid growth on Information Technology (IT) sectors, especially in mobile devices area
We achieved the accuracy of 83.33% and 79.17% from testing using the Mel Frequency Cepstral Coefficients (MFCC) Statistics feature for the first and latter scenario, respectively, whereas using Teager-MFCC Statistics feature achieved the accuracy of 85.42% and 83.33% for such scenarios, respectively
We can see that applying cross-corpus method by adding corpora from other languages to the training dataset can improve the overall performance of the emotion recognition, including the Indonesian Speech Emotion Recognition (SER)
Summary
Nowadays we are experiencing a rapid growth on Information Technology (IT) sectors, especially in mobile devices area. One simple application is the virtual assistant will compile a (song) playlist that is comforting the user if there is sad emotion recognized in the speech Because of this high potential of use, it is necessary to further analyze the emotion recognition process itself. The first main topic in this study is the use of cross-corpus method [7] for the Indonesian SER. There are three corpora: one German corpus and two English corpora Another main topic is the combination of two speech features, Mel Frequency Cepstral Coefficients (MFCC) features and Teager Energy features. The features will be combined with Teager Energy features [9] to hopefully achieve better result These speech features are extracted from the corpus and used along with their statistical values.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have