Abstract

We present single-channel approaches to robust automatic speech recognition (ASR) in reverberant environments based on non-intrusive estimation of the clarity index (C 50). Our best performing method includes the estimated value of C 50 in the ASR feature vector and also uses C 50 to select the most suitable ASR acoustic model according to the reverberation level. We evaluate our method on the REVERB Challenge database employing two different C 50 estimators and show that our method outperforms the best baseline of the challenge achieved without unsupervised acoustic model adaptation, i.e. using multi-condition hidden Markov models (HMMs). Our approach achieves a 22.4 % relative word error rate reduction in comparison to the best baseline of the challenge.

Highlights

  • Automatic speech recognition (ASR) is increasingly being used as a tool for a wide range of applications in diverse acoustic conditions

  • A reverberant sound is created in enclosed spaces by reflections from surfaces which create a multipath sound propagation from the source to the receiver. This effect varies with the acoustic properties of the room and the source-receiver distance, and it is characterized by the room impulse response (RIR)

  • The ASR evaluation tool is based on the hidden Markov model tool kit (HTK) provided by the REVERB Challenge

Read more

Summary

Introduction

Automatic speech recognition (ASR) is increasingly being used as a tool for a wide range of applications in diverse acoustic conditions (e.g. health care transcriptions, automatic translation, voicemail-to-text, and voice interface for command and control). Distant speech recognition is essential for natural and comfortable human-machine voice interfaces such as used in, for example, the automotive sector and smartphone mobile applications. A reverberant sound is created in enclosed spaces by reflections from surfaces which create a multipath sound propagation from the source to the receiver. This effect varies with the acoustic properties of the room and the source-receiver distance, and it is characterized by the room impulse response (RIR).

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.