Abstract
In end-to-end automatic speech recognition (ASR), a model is expected to implicitly learn representations suitable for recognizing a word-level sequence. However, the huge abstraction gap between input acoustic signals and output linguistic tokens makes it challenging for a model to learn the representations. In this work, to promote the word-level representation learning in end-to-end ASR, we propose a hierarchical conditional model that is based on connectionist temporal classification (CTC). Our model is trained by auxiliary CTC losses applied to intermediate layers, where the vocabulary size of each target subword sequence is gradually increased as the layer becomes close to the word-level output. Here, we make each level of sequence prediction explicitly conditioned on the previous sequences predicted at lower levels. With the proposed approach, we expect the proposed model to learn the word-level representations effectively by exploiting a hierarchy of linguistic structures. Experimental results on LibriSpeech-{100h, 960h} and TEDLIUM2 demonstrate that the proposed model improves over a standard CTCbased model and other competitive models from prior work. We further analyze the results to confirm the effectiveness of the intended representation learning with our model.
Full Text
Topics from this Paper
Connectionist Temporal Classification
Automatic Speech Recognition
Word-level Representation
Input Acoustic Signals
Representation Learning
+ Show 5 more
Create a personalized feed of these topics
Get StartedSimilar Papers
Oct 25, 2020
Jun 6, 2021
IEEE Transactions on Audio, Speech, and Language Processing
May 13, 2021
Jun 6, 2021
2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP)
Oct 1, 2016
Aug 30, 2021
arXiv: Audio and Speech Processing
Jul 17, 2020
Nov 1, 2019
IEEE/ACM Transactions on Audio, Speech, and Language Processing
Jan 1, 2020