Abstract

Speech emotion recognition (SER) is a challenging task with many problems unsolved, such as extracting representative features and dealing with imbalanced training data. For the past few decades, much research has been done in this area, but the performance is still far from satisfying. In this paper, we propose an Ensemble System which fuses four different subsystems. TDNN (Time Delay Neural Network) System uses a neural network with p-norm and time delay as the classifier. i-vector/SVM (Support Vector Machine) System learns the acoustic feature from i-vector space. Simple Late Fusion System fuses different features on the decision level while Balanced Late Fusion System introduces a data rebalance module to rebalance the class distribution of the training samples. The overall Ensemble System takes the advantages of each subsystem on the decision level. Experiments are conducted on the CHEAVD 2.0 database which is provided in the Multimodal Emotion Recognition Challenge. The results of the Simple Late Fusion System on the test set outperforms the baseline system by 3.9% and 6.9% on Accuracy (ACC) and Macro Average Precision (MAP), respectively. Our results indicate that the Simple Late Fusion System is more effective on ACC and MAP while the Balanced Late Fusion System outperforms other systems on Macro Average Recall and Macro Average F1.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.