Abstract

Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER has broad application prospects in the fields of criminal investigation and medical care. However, the complexity of emotion makes it hard to be recognized and the current SER model still does not accurately recognize human emotions. In this paper, we propose a multi-head self-attention based attention method to improve the recognition accuracy of SER. We call this method head fusion. With this method, an attention layer can generate some attention map with multiple attention points instead of common attention maps with a single attention point. We implemented an attention-based convolutional neural networks (ACNN) model with this method and conducted experiments and evaluations on the Interactive Emotional Dyadic Motion Capture(IEMOCAP) corpus, obtained on improvised data 76.18% of weighted accuracy (WA) and 76.36% of unweighted accuracy (UA), which is increased by about 6% compared to the previous state-of-the-art SER model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call