The paradigm of textual or display-based control in human-computer interaction (HCI) has changed in favor of more understandable control methods, such as gesture, voice, and imitation. Speech in particular contains a large quantity of information, revealing the speaker's inner state as well as his or her goal and intention. The speaker's request can be understood through language analysis, but additional speech features show the speaker's mood, purpose, and intention. As a consequence, in modern HCI systems, emotion identification from speech has become crucial. Additionally, it is challenging to aggregate the results of the many professionals engaged in emotion identification. There have been several methods for analyzing sound in the past. However, it was impossible to analyses people's emotions during a live speech. Studies on real-time data are now more prominent than ever because of the advancement of artificial intelligence and the great performance of deep learning techniques. This research uses a cutting-edge deep-learning technique to identify emotions in human speech. The research made use of the open-source Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. More than 2000 fragments of data were captured by 24 performers as speeches and songs for the RAVDESS dataset. The actors' responses to eight distinct moods were recorded. It was designed to find various emotion classifications. In this study, a novel neuro-fuzzy swallow swarm-optimized deep convolutional neural networks (NFSO-DCNN) approach for classification was suggested. The performance of the suggested model was compared to that of similar research, and the outcomes were assessed. Employing the suggested example on the RAVDESS dataset, an overall accuracy of 98.5% was attained for categorizing emotions
Read full abstract