Abstract

The predominant communication channel to convey relevant and high impact information is the emotions that is embedded on our communications. Researchers have tried to exploit these emotions in recent years for human robot interactions (HRI) and human computer interactions (HCI). Emotion recognition through speech or through facial expression is termed as single mode emotion recognition. The rate of accuracy of these single mode emotion recognitions are improved using the proposed bimodal method by combining the modalities of speech and facing and recognition of emotions using a Convolutional Neural Network (CNN) model. In this paper, the proposed bimodal emotion recognition system, contains three major parts such as processing of audio, processing of video and fusion of data for detecting the emotion of a person. The fusion of visual information and audio data obtained from two different channels enhances the emotion recognition rate by providing the complementary data. The proposed method aims to classify 7 basic emotions (anger, disgust, fear, happy, neutral, sad, surprise) from an input video. We take audio and image frame from the video input to predict the final emotion of a person. The dataset used is an audio-visual dataset uniquely suited for the study of multi-modal emotion expression and perception. Dataset used here is RAVDESS dataset which contains audio-visual dataset, visual dataset and audio dataset. For bimodal emotion detection the audio-visual dataset is used.

Highlights

  • We present you the evaluation and comparison of other experimented models like Random forest, Decision tree, Convolutional Neural Network (CNN) for speech and VGG, CNN, Manuscript received on April 12, 2021

  • The kernel size used in this experimentation is of size 3x3 and Rectified Linear Unit (ReLU) is used as the activation function. – Batch Normalization- To normalize the input in a scale of o to 1 values, the batch normalization operation is performed on inputs that are given to the layer to avoid the values scattered all over the place. – MaxPooling2D – In this model built the function uses a pooling window of size 2x2 with 2x2 strides to perform the pooling operation on the data. – Softmax – This function normalizes K real numbers taken from the input vector into a probability distribution consisting of K probabilities

  • This paper presented a bimodal emotion recognition system that uses information from the channels audio and visual data obtained from a video stream

Read more

Summary

INTRODUCTION

Emotions are a language independent means of communication universally that are expressed non verbally. Speech emotion recognition system is based on CNN model [15] It recognizes emotion using the feature extraction. A. Koduru et al [12] proposed a speech emotion recognition model that extracts the features and selects the required region of interest and classifies the emotion. The main focus of this work was to use different feature extraction algorithm to improve the speech emotion recognition rate. Zhou et al [13] uses both spectral and prosodic features for recognizing the emotion through speech input Both the spectral and the prosodic features contain emotion information, and combining of these spectral features and prosodic features will improve the performance of the emotion recognition system. The proposed method uses the short time log frequency power coefficients (LFPC) to represent the speech signals and a discrete hidden Markov model (HMM) for classification of emotions. Results suggests that the average accuracy of emotion classification is 78%

DATASET
Image and Audio extraction from Video
Facial Emotion Recognition
Speech Emotion Recognition
Bimodal Integration using Fusion Rule
RESULTS AND DISCUSSIONS
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.