Abstract

This study aims at learning deep features from different data to recognise speech emotion. The authors designed a merged convolutional neural network (CNN), which had two branches, one being one-dimensional (1D) CNN branch and another 2D CNN branch, to learn the high-level features from raw audio clips and log-mel spectrograms. The building of the merged deep CNN consists of two steps. First, one 1D CNN and one 2D CNN architectures were designed and evaluated; then, after the deletion of the second dense layers, the two CNN architectures were merged together. To speed up the training of the merged CNN, transfer learning was introduced in the training. The 1D CNN and 2D CNN were trained first. Then, the learned features of the 1D CNN and 2D CNN were repurposed and transferred to the merged CNN. Finally, the merged deep CNN initialised with transferred features was fine-tuned. Two hyperparameters of the designed architectures were chosen through Bayesian optimisation in the training. The experiments conducted on two benchmark datasets show that the merged deep CNN can improve emotion classification performance significantly.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call