Abstract

Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual eort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated \textit{deep representation learning} where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER.

Highlights

  • S Peech is a natural mode of communication among humans

  • As data is often imbalanced across the classes, in naturalistic emotion corpora, the accuracy is usually used as so-called unweighted accuracy (UA) or unweighted average recall (UAR), representing the average recall across classes, unweighted by the number of instances per classes

  • Studies are clustered into five major groups depending on the deep learning (DL) techniques employed for representation learning: 3.1 Supervised Representation Learning

Read more

Summary

Introduction

S Peech is a natural mode of communication among humans. It conveys affective information about emotional expression through explicit (linguistic) and implicit (paralinguistic) cues. The paralinguistic content of speech, on the other hand, provides an immense body of acoustic features that can be used to encode the emotional state of the speaker These acoustic features are reliable indicators of basic emotions and have been explored by different machine learning (ML) [2]–[4] as well as deep learning (DL) models [5]–[8] for speech emotion recognition (SER). It is found that the last step loses information and destroys spatial relations; it is usually omitted, which results in the LogMel spectrum, a popular feature used by the speech community. It is the most popular feature to train DL networks in the speech domain. They are designed/engineered to (a) index affective physiological changes in voice production, and (b) achieve automatic extractability [19]

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call