Detection of Emotion of Speech for RAVDESS Audio Using Hybrid Convolution Neural Network

Tanvi Puri,Mukesh Soni,Gaurav Dhiman,Malik Alazzam,Ihtiram Raza Khan,Osamah Ibrahim Khalaf,Antonio Gloria

doi:10.1155/2022/8472947

Abstract

Every human being has emotion for every item related to them. For every customer, their emotion can help the customer representative to understand their requirement. So, speech emotion recognition plays an important role in the interaction between humans. Now, the intelligent system can help to improve the performance for which we design the convolution neural network (CNN) based network that can classify emotions in different categories like positive, negative, or more specific. In this paper, we use the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) audio records. The Log Mel Spectrogram and Mel-Frequency Cepstral Coefficients (MFCCs) were used to feature the raw audio file. These properties were used in the classification of emotions using techniques, such as Long Short-Term Memory (LSTM), CNNs, Hidden Markov models (HMMs), and Deep Neural Networks (DNNs). For this paper, we have divided the emotions into three sections for males and females. In the first section, we divide the emotion into two classes as positive. In the second section, we divide the emotion into three classes such as positive, negative, and neutral. In the third section, we divide the emotions into 8 different classes such as happy, sad, angry, fearful, surprise, disgust expressions, calm, and fearful emotions. For these three sections, we proposed the model which contains the eight consecutive layers of the 2D convolution neural method. The purposed model gives the better-performed categories to other previously given models. Now, we can identify the emotion of the consumer in better ways.

Highlights

Speech is the direct way to transfer information from one end to another end. It contains a wide variety of information, and it can express rich emotional information through the emotions it contains and visualize it in response to objects, scenes, or events. e automatic recognition of emotions by analyzing the human voice and facial expressions has become this subject. e following systems can be cited as an example of the areas in which these studies are used and their intended use is provided: (i) Education: A course system for distance education can detect bored users so that they can change the style or level of the material provided and, in addition, provide emotional incentives or compromises
Some variables such as pulse, blood pressure, facial expressions, body movements, brain waves, and acoustic properties vary depending on the emotional state, pulse, blood pressure, brainwaves, and so forth
Changes cannot be detected without a portable medical device, facial expressions and voice signals can be received directly without connecting any device to the person

Summary

Introduction

Speech is the direct way to transfer information from one end to another end. It contains a wide variety of information, and it can express rich emotional information through the emotions it contains and visualize it in response to objects, scenes, or events. e automatic recognition of emotions by analyzing the human voice and facial expressions has become this subject. e following systems can be cited as an example of the areas in which these studies are used and their intended use is provided:(i) Education: A course system for distance education can detect bored users so that they can change the style or level of the material provided and, in addition, provide emotional incentives or compromises.(ii) Automobile: Driving performance and the emotional state of the driver are often linked internally. erefore, these systems can be used to promote the driving experience and to improve driving performance.Journal of Healthcare Engineering (iii) Security: ey can be used as support systems in public spaces by detecting extreme feelings such as fear and anxiety.(iv) Communication: In call centers, when the automatic emotion recognition system is integrated with the interactive voice response system, it can help improve customer service.(v) Health: It can be beneficial for people with autism who can use portable devices to understand their feelings and emotions and possibly adjust their social behavior [1].It is known that some physiological changes occur in the body due to people’s emotional state. Speech is the direct way to transfer information from one end to another end It contains a wide variety of information, and it can express rich emotional information through the emotions it contains and visualize it in response to objects, scenes, or events. E automatic recognition of emotions by analyzing the human voice and facial expressions has become this subject. It is known that some physiological changes occur in the body due to people’s emotional state. Changes cannot be detected without a portable medical device, facial expressions and voice signals can be received directly without connecting any device to the person. For this reason, most studies on this topic have focused on the automatic recognition of emotions using visual and auditory signals. Acoustic signals are the most used data after facial signs to identify a person’s emotional state [2]

Methods

Results

Conclusion