In recent times, there has been a lot of focus placed on speech emotion recognition, often known as SER. SER is a fundamental approach to emotional human-machine interaction. Utilising DCNNs has resulted in significant advancements being made, notably with regard to the learning of high-level characteristics for SER. However, one of the primary challenges that arises when applying deep neural networks to SER is the problem of overfitting, which occurs commonly when too many parameters are chosen using insufficiently large datasets. The majority of the research that has been done to attempt to solve this problem has focused on translating the audio input into a picture and employing transfer learning methods. According to research, the patterns generated in this region include significant emotional aspects of the speaker. A new speech signal format called Chaogram was designed to project these patterns, which would result in three channels like RGB images, in order to give an input that works with VGG-based Deep-CNNs. The original phase space is then used to reconstruct the voice samples in a three-dimensional space. In the subsequent stage, the Chaogram photos' finer details were accentuated by image enhancing techniques. To learn Chaogram's high-level features and emotion classifications, the Visual Geometry Group (VGG) deep convolutional neural network (DCNN) is employed after being pre-trained on the massive ImageNet dataset using intelligent sensors. We next apply transfer learning to our datasets to further improve the provided model, which we combine with the proposed model. To enhance the hyper-parameter layout of architecture-determined CNNs, a new Deep-CNN with BWO (Beluga Whale Optimization) is introduced. In order to apply Deep–CNN–KL and to the field of SER, thispropose a new method of representing speech signals, which we call Chaograms, by mapping the signal onto a 2D image. On two publicly available emotion datasets, these findings demonstrate the ability of the proposed approach, which has the potential to significantly improve SER applications: EMO-DB and eNTERFACE05. Image enhancement methods that place more emphasis on such features may lead to more precise classification in the performance analysis. Classification accuracy can be considerably improved by adapting these images to work with the pre-trained Deep-CNN inputs.
Read full abstract