Abstract

The performance of automatic speech recognition systems degrades in the presence of emotional states and in adverse environments (e.g., noisy conditions). This greatly limits the deployment of speech recognition application in realistic environments. Previous studies in the emotion-affected speech recognition field focus on improving emotional speech recognition using clean speech data recorded in a quiet environment (i.e., controlled studio settings). The goal of this research is to increase the robustness of speech recognition systems for emotional speech in noisy conditions. The proposed binaural emotional speech recognition system is based on the analysis of binaural input signal and an estimated emotional auditory mask corresponding to the recognized emotion. Whereas the binaural signal analyzer has the task of segregating speech from noise and constructing speech mask in a noisy environment, the estimated emotional mask identifies and removes the most emotionally affected spectro-temporal regions of the segregated target speech. In other words, our proposed system combines the two estimated masks (binary mask and emotion-specific mask) of noise and emotion, as a way to decrease the word error rate for noisy emotional speech. The performance of the proposed binaural system is evaluated in clean neutral train/noisy emotional test scenarios for different noise types, signal-to-noise ratios, and spatial configurations of sources. Speech utterances of the Persian emotional speech database are used for the experimental purposes. Simulation results show that the proposed system achieves higher performance, as compared with automatic speech recognition systems chosen as baseline trained with neutral utterances.

Highlights

  • Speech is the most convenient means of communication for humans

  • This paper proposes a binaural emotional speech recognition (BESR) system based on known principles of computational auditory scene analysis (CASA)

  • The recognition result for the neutral case is obtained through a distinct experiment in which 80% of the total neutral utterances are used for the training of the acoustic model, and the rest is reserved for the testing

Read more

Summary

Introduction

Scientific and technical improvements in speech technology have resulted in a more natural human-machine speech interaction and natural language processing systems. Despite all the recent advances in speech technology, often these systems struggle with issues caused by speech variabilities. These variabilities can occur due to speaker-dependent characteristics (e.g., shape of the vocal tract, age, gender, and emotional states), environmental noise, channel distortion, speaking rate, and accent variabilities [1, 2]. Noise and speech variabilities such as emotions degrade the performance of the ASR systems, and this greatly limits the deployment of the systems in realistic situations [3, 4]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call