Abstract
In an attempt to make Human-Computer Interactions more natural, we propose the use of Tensor Factorized Neural Networks (TFNN) and Attention Gated Tensor Factorized Neural Network (AG-TFNN) for Speech Emotion Recognition (SER) task. Standard speech representations such as 2D and 3D Mel-Spectrogram and Temporal Modulation Spectrogram is explored to investigate the emotion salient information capturing effectiveness of the Tensor Factorization based architectures. The hidden layers are explained as Deep Tensor Factorization based on the Tucker Decomposition but with a unified discriminative objective function to learn the factor matrices in a discriminative sense. The core tensor produced in each hidden layer is the feature associated with that factorisation layer. Mel Spectrograms are naturally in 2D tensor form, and thus TFNN and AG-TFNN becomes an appropriate choice over baselines such as Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) by providing reduced parameters to be learned and a simple architecture. Experiments conducted on standard emotional speech datasets- Emo-DB and IEMOCAP shows that TFNN and AG-TFNN surpasses the state-of-the-art given by CNN+LSTM combination with fewer number of parameters.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have