Abstract

In this paper, we propose a speech emotion recognition (SER) method with a multi-task learning-based convolutional neural network (MTL-CNN). It has been recently reported that classifiers using deep neural networks (DNNs) outperformed the hidden Markov model (HMM) and support vector machine (SVM). However, such DNN-based classifiers still have a generalization error problem due to limited training data. To mitigate this problem, the proposed method incorporates multi-task learning (MTL) as transfer learning. In other words, the proposed MTL-based convolutional neural network (MTL-CNN) contains the classification of arousal level, valence level, and gender as three auxiliary tasks. Training the main emotion classification task with three auxiliary tasks helps the MTL-CNN learn useful features and the relationships between tasks. It is demonstrated through SER experiments that an SER system using the proposed MTL-CNN achieves a relative F1-score improvement of 3.64% for a task on a Berlin database of emotional speech compared with using the CNN with a single emotion recognition task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call