Effective Training of RNN Transducer Models on Diverse Sources of Speech and Text Data

Takashi Fukuda,Samuel Thomas

doi:10.1109/icassp49357.2023.10095218

Abstract

This paper proposes a novel modeling framework for effective training of end-to-end automatic speech recognition (ASR) models on various sources of data from diverse domains: speech paired with clean ground truth transcripts, speech with noisy pseudo transcripts from semi-supervised decodes and unpaired text-only data. In our proposed approach, we build a recurrent neural network transducer (RNN-T) model with a shared multimodal encoder, multi-branch prediction networks and a shared common joint network. To train on unpaired text-only data sets along with transcribed speech data, the shared encoder is trained to process both speech and text modalities. Differences in data from multiple domains are effectively handled by training a multi-branch prediction network on various different data sets before an interpolation step combines the multi-branch prediction networks back into a computationally-efficient single branch. We show the benefit of our proposed technique on several ASR test sets by comparing our models to those trained by simple data mixing. The technique provides a significant relative improvement of up to 6% over baseline systems operating at a similar decoding cost.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Effective Training of RNN Transducer Models on Diverse Sources of Speech and Text Data

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Alignment Restricted Streaming Recurrent Neural Network Transducer
Jay Mahadeokar ... Thong Le
-
Jay Mahadeokar, et. al.Jay Mahadeokar ... Thong Le
19 Jan 2021
19 Jan 2021

Augmenting Images for ASR and TTS Through Single-Loop and Dual-Loop Multimodal Chain Framework
Johanes Effendi ... Satoshi Nakamura
-
Johanes Effendi, et. al.Johanes Effendi ... Satoshi Nakamura
25 Oct 2020
25 Oct 2020

Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer
Kanishka Rao ... Rohit Prabhavalkar
-
Kanishka Rao, et. al.Kanishka Rao ... Rohit Prabhavalkar
01 Dec 2017
01 Dec 2017

Cascade RNN-Transducer: Syllable Based Streaming On-Device Mandarin Speech Recognition with a Syllable-To-Character Converter
Xiong Wang ... Xian Shi
-
Xiong Wang, et. al.Xiong Wang ... Xian Shi
19 Jan 2021
19 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Effective Training of RNN Transducer Models on Diverse Sources of Speech and Text Data

Abstract

Talk to us

Similar Papers