Abstract

The performance of speech recognition systems trained with neutral utterances degrades significantly when these systems are tested with emotional speech. Since everybody can speak emotionally in the real-world environment, it is necessary to take account of the emotional states of speech in the performance of the automatic speech recognition system. Limited works have been performed in the field of emotion-affected speech recognition and so far, most of the researches have focused on the classification of speech emotions. In this paper, the vocal tract length normalization method is employed to enhance the robustness of the emotion-affected speech recognition system. For this purpose, two structures of the speech recognition system based on hybrids of hidden Markov model with Gaussian mixture model and deep neural network are used. To achieve this goal, frequency warping is applied to the filterbank and/or discrete-cosine transform domain(s) in the feature extraction process of the automatic speech recognition system. The warping process is conducted in a way to normalize the emotional feature components and make them close to their corresponding neutral feature components. The performance of the proposed system is evaluated in neutrally trained/emotionally tested conditions for different speech features and emotional states (i.e., Anger, Disgust, Fear, Happy, and Sad). In this system, frequency warping is employed for different acoustical features. The constructed emotion-affected speech recognition system is based on the Kaldi automatic speech recognition with the Persian emotional speech database and the crowd-sourced emotional multi-modal actors dataset as the input corpora. The experimental simulations reveal that, in general, the warped emotional features result in better performance of the emotion-affected speech recognition system as compared with their unwarped counterparts. Also, it can be seen that the performance of the speech recognition using the deep neural network-hidden Markov model outperforms the system employing the hybrid with the Gaussian mixture model.

Highlights

  • Speech is the natural medium of communication for humans

  • The results show that adding supplementary features such as pitch and formant frequencies to the feature vector is useful in improving emotional speech recognition

  • 3.1 Experimental setup To examine the effectiveness of the frequency warping for mel-frequency cepstral coefficient (MFCC), Modified mel-scale cepstral coefficient (M-MFCC), exponential logarithmic scale (ExpoLog), gammatone filterbank cepstral coefficient (GFCC), and power normalized cepstral coefficient (PNCC) in the speech recognition system, the performances of these features and their corresponding warped features are evaluated in the Kaldi baseline automatic speech recognition (ASR) system for different emotional states

Read more

Summary

Introduction

Speech is the natural medium of communication for humans. In recent years, improvements in speech technology have led to a considerable enhancement in human-computer interaction. For a person speaking emotionally, the anatomical features of the speaker regarding the structure of his/her vocal tract are changed compared to those of a neutral speaking person This fact implies that compensating the emotion-related variabilities on a speech by the technique of VTLN could increase the speech recognizer performance in emotional conditions. The technique of VTLN is applied to other acoustical features than MFCCs to develop more robust features which can be used in improving the performance of the EASR system Another aspect of the present work concerns the use of the deep neural network (DNN) in the structure of speech recognizer. The simulation results presented include examining the effect of applying CMN in the feature extraction process, investigating the influence of using different ranges of frequency warping, and evaluating the performance of various frequency warping methods for the GMM-HMM/DNN-HMM EASR system.

Methods
Methods of VTLN
Experiments and evaluations
Results and discussions
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call