Abstract

In sophisticated Human-Computer Interfaces (HCI), the emotional state of the user is becoming a crucial component that is closely linked to emotional speech recognition. Spoken expressions, which can be a part of human-machine interaction, are an important source of emotional information. Speech emotion recognition (SER) in deep learning (DL) continues to be a hot topic, especially in the field of emotional computing. Current deep learning (DL) and neural network methods are applied in this highly active field of research. This is as a result of its expanding potential, advancements in algorithms, and practical uses. Quantitative factors such as pitch, intensity, accent and Mel-Frequency Cepstral Coefficients (MFCC) can be employed to model the paralinguistic data contained in human speech. To achieve SER, three key procedures are usually followed: data processing, feature selection/extraction, and classification based on the underlying emotional qualities. The nature of these processes and the peculiarities of human speech lend support to the employment of DL techniques for SER implementation. A variety of DL methods have been used for SER tasks in recent affective computing research works; however, only a small number of them capture the underlying ideas and methodologies that can be used to facilitate the three main steps of SER implementation. With a focus on the three SER implementation processes, we provide a state-of-the-art assessment of research conducted over the last ten years that tackled SER tasks from DL perspectives in this work. Various issues are covered in detail, including the problem of low classification accuracy of Speaker-Independent experiments and the related remedies. The review offers principles for SER evaluation as well, emphasizing indicators that can be experimented with and common baselines.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call