Exploring Hybrid CTC/Attention End-to-End Speech Recognition with Gaussian Processes

Ludwig Kürzinger,Robert Baumgartner,Tobias Watzel,Lujun Li,Gerhard Rigoll

doi:10.1007/978-3-030-26061-3_27

Abstract

Hybrid CTC/attention end-to-end speech recognition combines two powerful concepts. Given a speech feature sequence, the attention mechanism directly outputs a sequence of letters. Connectionist Temporal Classification (CTC) helps to bind the attention mechanism to sequential alignments. This hybrid architecture also gives more degrees of freedom in choosing parameter configurations. We applied Gaussian process optimization to estimate the impact of network parameters and language model weight in decoding towards Character Error Rate (CER), as well as attention accuracy. In total, we trained 70 hybrid CTC/attention networks and performed 590 beam search runs with an RNNLM as language model on the TEDlium v2 test set. To our surprise, the results challenge the assumption that CTC primarily regularizes the attention mechanism. We argue in an evidence-based manner that CTC instead regularizes the impact of language model feedback in a one-pass beam search, as letter hypotheses are fed back into the attention mechanism. Attention-only models without RNNLM already achieved \(10.9\%\) CER, or \(22.4\%\) Word Error Rate (WER), on the TEDlium v2 test set. Combined decoding of same attention-only networks with RNNLM strongly underperformed, with at best \(40.2\%\) CER, or, \(49.3\%\) WER. A combined hybrid CTC/attention model with RNNLM performed best, with \(8.9\%\) CER, or \(17.6\%\) WER.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Exploring Hybrid CTC/Attention End-to-End Speech Recognition with Gaussian Processes

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Intermediate Loss Regularization for CTC-Based Speech Recognition
Jaesong Lee ... Shinji Watanabe
-
Jaesong Lee, et. al.Jaesong Lee ... Shinji Watanabe
06 Jun 2021
06 Jun 2021

CTC-Based End-To-End ASR for the Low Resource Sanskrit Language with Spectrogram Augmentation
Anoop C S ... A G Ramakrishnan
-
Anoop C S, et. al.Anoop C S ... A G Ramakrishnan
27 Jul 2021
27 Jul 2021

ЕND-TO-END SPEECH RECOGNITION SYSTEMS FOR AGGLUTINATIVE LANGUAGES
Akbayan Bekarystankyzy ... Orken Mamyrbayev
Scientific Journal of Astana IT University | VOL. -
Akbayan Bekarystankyzy, et. al.Akbayan Bekarystankyzy ... Orken Mamyrbayev
30 Mar 2023
Scientific Journal of Astana IT University | VOL. -

Customized deep learning based Turkish automatic speech recognition system supported by language model.
Yasin Görmez
PeerJ Computer Science | VOL. 10
Yasin GörmezYasin Görmez
03 Apr 2024
PeerJ Computer Science | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploring Hybrid CTC/Attention End-to-End Speech Recognition with Gaussian Processes

Abstract

Talk to us

Similar Papers