Ensemble System for Multimodal Emotion Recognition Challenge (MEC 2017)

Xiaotong Zhang,Thomas Fang Zheng,Mingxing Xu

doi:10.1109/aciiasia.2018.8470352

Abstract

Speech emotion recognition (SER) is a challenging task with many problems unsolved, such as extracting representative features and dealing with imbalanced training data. For the past few decades, much research has been done in this area, but the performance is still far from satisfying. In this paper, we propose an Ensemble System which fuses four different subsystems. TDNN (Time Delay Neural Network) System uses a neural network with p-norm and time delay as the classifier. i-vector/SVM (Support Vector Machine) System learns the acoustic feature from i-vector space. Simple Late Fusion System fuses different features on the decision level while Balanced Late Fusion System introduces a data rebalance module to rebalance the class distribution of the training samples. The overall Ensemble System takes the advantages of each subsystem on the decision level. Experiments are conducted on the CHEAVD 2.0 database which is provided in the Multimodal Emotion Recognition Challenge. The results of the Simple Late Fusion System on the test set outperforms the baseline system by 3.9% and 6.9% on Accuracy (ACC) and Macro Average Precision (MAP), respectively. Our results indicate that the Simple Late Fusion System is more effective on ACC and MAP while the Balanced Late Fusion System outperforms other systems on Macro Average Recall and Macro Average F1.

Full Text