Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Kartik Audhkhasi,George Saon,David Nahamoo,Bhuvana Ramabhadran,Michael Picheny

doi:10.21437/interspeech.2017-546

Abstract

Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units, CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation
Gakuto Kurata ... Kartik Audhkhasi
-
Gakuto Kurata, et. al.Gakuto Kurata ... Kartik Audhkhasi
15 Sep 2019
15 Sep 2019

Advancing Acoustic-to-Word CTC Model With Attention and Mixed-Units
Amit Das ... Yifan Gong
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 27
Amit Das, et. al.Amit Das ... Yifan Gong
04 Sep 2019
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 27

Confidence measures for CTC-based phone synchronous decoding
Zhehuai Chen ... Yimeng Zhuang
-
Zhehuai Chen, et. al.Zhehuai Chen ... Yimeng Zhuang
01 Mar 2017
01 Mar 2017

Advancing Acoustic-to-Word CTC Model
Jinyu Li ... Rui Zhao
-
Jinyu Li, et. al.Jinyu Li ... Rui Zhao
01 Apr 2018
01 Apr 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Abstract

Talk to us

Similar Papers