Abstract

Publicly available datasets traditionally used to train E2E ASR models for conversational telephone speech recognition are based on clean, short duration, single speaker utterances collected on separate channels. While E2E ASR models achieve state-of-the-art performance on recognition tasks that match well with such training data, they are observed to fail on test recordings that contain multiple speakers, significant channel or background noise or span longer durations than training data utterances. To mitigate these issues, we propose an on-the-fly data augmentation strategy that transforms single speaker training data into multiple speaker data by appending together multiple single speaker utterances. The proposed technique encourages the E2E model to become robust to speaker changes and also process longer utterances effectively. During training, the model is also guided by a teacher model trained on single speaker utterances to map its multi-speaker encoder embeddings to better performing single speaker representations. With the proposed technique we obtain 7-14% relative improvement on various single speaker and multiple speaker test sets. We also show how this technique is able to improve recognition performance by up to 14% by capturing useful information from preceding spoken utterances used as dialog history.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.