Abstract
In this paper, the intrinsic characteristics of speech modulations are estimated to propose the instant modulation spectral features for efficient emotion recognition. This feature representation is based on single frequency filtering (SFF) technique and higher order nonlinear energy operator. The speech signal is decomposed into frequency sub-bands using SFF, and associated nonlinear energies are estimated with higher order nonlinear energy operator. Then, the feature vector is realized using cepstral analysis. The high-resolution property of SFF technique is exploited to extract the amplitude envelope of the speech signal at a selected frequency with good time-frequency resolution. The fourth order nonlinear energy operator provides noise robustness in estimating the modulation components. The proposed feature set is tested for the emotion recognition task using the i-vector model with the probabilistic linear discriminant scoring scheme, support vector machine and random forest classifiers. The results demonstrate that the performance of this feature representation is better than the widely used spectral and prosody features, achieving detection accuracy of 85.75%, 59.88%, and 65.78% on three emotional databases, EMODB, FAU-AIBO, and IEMOCAP, respectively. Further, the proposed features are found to be robust in the presence of additive white Gaussian and vehicular noises.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.