Abstract
1Present address: Cirrus Logic, Marble Arch House, 66 Seymour St., 1st Floor, London W1H 5BT, United Kingdom.Automatic speech recognition in everyday environments must be robust to significant levels of reverberation and noise. One strategy to achieve such robustness is multi-microphone speech enhancement. In this study, we present results of an evaluation of different speech enhancement pipelines using a state-of-the-art ASR system for a wide range of reverberation and noise conditions. The evaluation exploits the recently released ACE Challenge database which includes measured multichannel acoustic impulse responses from 7 different rooms with reverberation times ranging from 0.33 to 1.34 s. The reverberant speech is mixed with ambient, fan and babble noise recordings made with the same microphone setups in each of the rooms. In the first experiment, performance of the ASR without speech processing is evaluated. Results clearly indicate the deleterious effect of both noise and reverberation. In the second experiment, different speech enhancement pipelines are evaluated with relative word error rate reductions of up to 82%. Finally, the ability of selected instrumental metrics to predict ASR performance improvement is assessed. The best performing metric, Short-Time Objective Intelligibility Measure, is shown to have a Pearson correlation coefficient of 0.79, suggesting that it is a useful predictor of algorithm performance in these tests.
Highlights
TagedPReal-world applications of Automatic Speech Recognition (ASR), such as meeting transcription and human-robot interaction, demand that the speaker be some distance from the sound capture device
It can be seen that, for all noise types, noise is the dominant source of error at low Signalto-Noise Ratios (SNRs), with Word Error Rate (WER) tending to 100% at ¡10 dB SNR
It should be noted that 100% WER is not an upper bound since there is no limit to the number of insertions which could occur, 3 http://www.commsp.ee.ic.ac.uk/~sap/resources/csl-ace-asr
Summary
TagedPReal-world applications of Automatic Speech Recognition (ASR), such as meeting transcription and human-robot interaction, demand that the speaker be some distance from the sound capture device. It has been proposed that robustness in distant-talking ASR be achieved through three approaches, namely enhancement of the audio signal, front-end based approaches which enhance the signal in the feature domain and back-end methods (Haeb-Umbach and Krueger, 2012). Recent challenges such as the REVERB challenge (Kinoshita et al, 2013) and CHiME3 (Barker et al, 2015) have demonstrated the effectiveness of all three approaches
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.