Abstract One of the most important issues in human-computer interaction is to create a system that can hear and respond correctly like a human. This has led to the design of the Automatic Speech Emotion Recognition system (SER) that is able to identify different emotional classes by extracting and selecting effective features from speech signals. For this reason, in this study, we propose a novel feature extraction method based on adaptive time-frequency coefficients to improve the SER. The simulations are performed using the Berlin Emotional Speech Database (EMO-DB), the Surrey Audio-Visual Expressed Emotion Database (SAVEE), and the Persian Drama Radio Emotional Corpus (PDREC). The main contribution of our work is to extract novel features, called Adaptive Time-Frequency features, based on the Fractional Fourier Transform and to combine them with Cepstral features. Experimental results show that the proposed method effectively identifies different emotional classes in EMO-DB (97.57% accuracy), SAVEE (80% accuracy), and PDREC (91.46% accuracy) data sets.
Read full abstract