Speech emotion recognition based on multi-task learning using a convolutional neural network

Jung Hyuk Lee,Nam Kyun Kim,Jiwon Lee,Hong Kook Kim,Geon Woo Lee,Hun Kyu Ha

doi:10.1109/apsipa.2017.8282123

Abstract

In this paper, we propose a speech emotion recognition (SER) method with a multi-task learning-based convolutional neural network (MTL-CNN). It has been recently reported that classifiers using deep neural networks (DNNs) outperformed the hidden Markov model (HMM) and support vector machine (SVM). However, such DNN-based classifiers still have a generalization error problem due to limited training data. To mitigate this problem, the proposed method incorporates multi-task learning (MTL) as transfer learning. In other words, the proposed MTL-based convolutional neural network (MTL-CNN) contains the classification of arousal level, valence level, and gender as three auxiliary tasks. Training the main emotion classification task with three auxiliary tasks helps the MTL-CNN learn useful features and the relationships between tasks. It is demonstrated through SER experiments that an SER system using the proposed MTL-CNN achieves a relative F1-score improvement of 3.64% for a task on a Berlin database of emotional speech compared with using the CNN with a single emotion recognition task.

Full Text