Abstract

Speech cues may be used to identify human emotions using deep learning model of speech emotion recognition using supervised learning or unsupervised learning as machine learning concepts, and then it build the speech emotion databases for test data prediction. Despite of many advantageous, still it suffers from accuracy and other aspects. In order to mitigate those issues, we propose a new feature specific hybrid framework on composition of deep learning architecture such as recurrent neural network and convolution neural network for speech emotion recognition. It analyses different characteristics to make a better description of speech emotion. Initially it uses feature extraction technique using bag-of-Audio-word model to Mel-frequency cepstral factor characteristics and a pack of acoustic words composed of emotion features to feed the hybrid deep learning architecture to result in high classification and prediction accuracy. In addition, the proposed hybrid networks’ output is concatenated and loaded into this layer of softmax, which produces a for speech recognition, a categorical classification statistic is used. The proposed model is based on the Ryerson Audio-Visual Database of Emotional Speech and Song audio (RAVDESS) dataset, which comprises eight emotional groups. Experimental results on dataset prove that proposed framework performs better in terms of 89.5% recognition rate and 98% accuracy against state of art approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.