In this paper, a hybrid system consisting of three stages of feature extraction, dimensionality reduction, and feature classification is proposed for speech emotion recognition (SER). At feature extraction stage, an informationally-rich spectral-prosodic hybrid feature vector comprised of perceptual-spectral features; that is, mel-frequency cepstral coefficient (MFCC), perceptual linear prediction coefficient (PLPC), and perceptual minimum variance distortionless response (PMVDR) coefficient along with the prosodic feature of pitch (i.e. F0) are extracted for each frame. This feature vector is extracted from both speech signal and its glottal-waveform. The first and the second-order derivatives are then added to the above-mentioned vector to form a high-dimensional hybrid feature vector characterized by a large number of dimensions. At the next stage, i.e. dimensionality reduction, the dimensionality of this feature vector is reduced using a new proposed quantum-behaved particle swarm optimization (QPSO)-based approach. In this paper, a new QPSO algorithm (so-called, pQPSO) is presented that makes use of a truncated Laplace distribution (TLD) to generate new particles and thus to produce solutions (i.e. particles) that are all within a valid range of a problem (contrary to the standard QPSO). The contraction-expansion (CE) factor of the proposed pQPSO is also selected adaptively. Using the proposed QPSO algorithm, an optimal discriminative dimensionality reduction matrix (i.e. projection matrix) is estimated with emotion classification accuracy as a class-discriminative criterion. At the subsequent stage, vectors with reduced feature dimensionality are fed into a Gaussian elliptical basis function (GEBF)-type neural network classifier to detect their speech emotion. To accelerate the training phase of the GEBF classifier, a fast-scaled conjugate gradient (SCG) algorithm is correspondingly employed that does not need to adjust the learning rate. Finally, the proposed method is evaluated on three standard emotional speech databases of Berlin Database of Emotional Speech (EMODB), Surrey Audio-Visual Expressed Emotion (SAVEE), and Interactive Emotional Dyadic Motion Capture (IEMOCAP). The experimental results showed that the proposed method was more accurate than state-of-the-art ones in terms of detecting speech emotions.
Read full abstract