Prosody modification for speech recognition in emotionally mismatched conditions

Vishnu Vidyadhara Raju Vegesna,Anil Kumar Vuppala,Krishna Gurugubelli

doi:10.1007/s10772-018-9503-z

Vishnu Vidyadhara Raju Vegesna, Anil Kumar Vuppala + Show 1 more

https://doi.org/10.1007/s10772-018-9503-z

Copy DOI

Abstract

A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of this work is to improve the performance of ASR in the presence of emotional conditions using prosody modification. The influence of different emotions on the prosody parameters is exploited in this work. Emotion conversion methods are employed to generate the word level non-uniform prosody modified speech. Modification factors for prosodic components such as pitch, duration and energy are used. The prosody modification is done in two ways. Firstly, emotion conversion is done at the testing stage to generate the neutral speech from the emotional speech. Secondly, the ASR is trained with the generated emotional speech from the neutral speech. In this work, the presence of emotions in speech is studied for the Telugu ASR systems. A new database of IIIT-H Telugu speech corpus is collected to build the large vocabulary neutral Telugu speech ASR system. The emotional speech samples from IITKGP-SESC Telugu corpus are used for testing it. The emotions of anger, happiness and compassion are considered during the evaluation. An improvement in the performance of ASR systems is observed in the prosody modified speech.

Full Text