Robots must have the ability to recognize human emotions to engage with individuals and to plan their movements autonomously. Nonverbal signals encompassing pitch, loudness, spectrum, and speech speed are successful methods for transmitting emotions to most individuals. Provided this situation, a machine could have the capability of qualified figure out emotions by deploying the traits of spoken communication, and these potentially propagate vital information concerning the speaker's emotional state. More precisely, a combination of numerous facial action units can be employed to describe a human's emotion. In this paper, we propose a deep Convolutional Neural Network-based system that can identify sentiments in real-time and with a high accuracy rate. This investigation establishes an entirely novel speech-emotion detection system founded on Convolutional Neural Networks (CNNs). With the support of a top-tier GPU, a model is developed and nourished raw speech from a specific set for training, classification, and testing purposes. We further analyze the speech data and incorporate the information from the visual and audio sources to enhance the recognition system's accuracy. The benefits of the proposed method for emotion identification and the implications of blending visual and aural suggestions are made clear by the experimental results. Convolutional neural networks (CNNs) are taught on grayscale images from the softmax dataset in the present work. To acquire the best accuracy, we experimented with different baselines and max pooling layers, finally acquiring 89.98% accuracy. Dropout is one approach we have used to ward off overfitting.
Read full abstract