Abstract

Speech emotion recognition (SER) is a crucial field of research in artificial intelligence and human–computer interaction. Extracting effective speech features for emotion recognition is a continuing research focus in SER. Most research has focused on finding an optimal speech feature to extract hidden local features while ignoring the global relationships of the speech signal. In this paper, we propose a method that utilizes a multiscale-multichannel feature extraction structure with global and local information to obtain comprehensive speech features. Our approach employs a one-dimensional convolutional neural network (1D CNN) for feature learning and emotion recognition, capturing both spectral and spatial characteristics of speech for superior learning capabilities with improved SER results. We conducted extensive experiments on publicly available emotion recognition datasets, employing three distinct data augmentation (DA) techniques to enhance model generalization. Our model utilized Mel-frequency cepstral coefficients and zero-crossing rate features from speech samples for training and outperformed state-of-the-art techniques in terms of accuracy. Additionally, we conducted experiments to validate the effectiveness and reliability of our proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call