Speech emotion recognition (SER) refers to the process of recognizing the emotional state of the speaker from the speech utterance. In earlier studies, wide varieties of cepstral features have been proposed to develop SER systems. The mel-frequency cepstral coefficients (MFCC), and human-factor cepstral coefficients (HFCC) are two popularly used variants of cepstral features. MFCC and HFCC features are extracted from speech signals using mel and human-factor filter banks, respectively. The magnitude response of individual filters in these filter banks is triangular in shape. As a result, these filter banks are referred to as triangular filter banks (TFB) and the corresponding extracted cepstral coefficients can be denoted as TFBCC-M (in case of MFCC) and TFBCC-HF (in case of HFCC). The mel-filter bank (TFB-M) is constructed using mel-scale, while the human-factor filter bank (TFB-HF) is constructed using human factor scale, which is a combination of mel and equivalent rectangular bandwidth (ERB) scales. Similarly, different frequency scales can be used to realize different TFBs to extract different types of TFBCC features. In this direction, this paper proposes two new TFBs denoted as TFB-B and TFB-E, realized using bark and ERB scales to extract new cepstral features referred to as TFBCC-B and TFBCC-E, respectively. The mathematical background to construct the proposed TFB-B and TFB-E is presented. The proposed filter banks are used along with the conventional TFB-M and TFB-HF to extract four different types of TFBCC features. These features are extracted from the emotional speech signals of two databases, namely Berlin database of emotional speech (Emo-DB) and Surrey audio-visual expressed emotion speech database (SAVEE). The extracted features are used to develop speaker-dependent (SD) and speaker-independent (SI) based SER systems using support vector machines. The performance of the respective features is analyzed in terms of isolated and combined usage. The experimental results show that the cepstral features extracted using the proposed TFBs are effective in characterizing and recognizing emotions similar to conventional MFCC and HFCC features. Moreover, the combined use of different cepstral features have resulted to improve the overall recognition performance of SER systems. In case of Emo-DB database, isolated use of the proposed TFBCC-B and TFBCC-E features achieve recognition accuracies of 83.23% and 81.99% for SD scenario, and 75% and 60.94% for SI scenario, respectively. Similarly, for SAVEE database, the recognition accuracies of 75% and 66.67% for SD scenario, and 44.17% and 55% for SI scenario are achieved. In case of Emo-DB database, the maximum recognition accuracies of 86.96% (for different combinations of conventional and proposed features namely, TFBCC-{(M+E), (M+B+E), (HF+B+E), (M+HF+B+E)}) and 77.08% (for combination TFBCC-(M+B+E)) are achieved for SD and SI scenarios, respectively. Similarly, for SAVEE database, the maximum recognition accuracies of 77.08% (for combination TFBCC-(M+HF+E)), and 55.83% (for combination TFBCC-(B+E)) are achieved for SD and SI scenarios, respectively.
Read full abstract