Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models

Javier Jorge,Alfons Juan,Albert Sanchis,Joan Albert Silvestre-Cerda,Adria Gimenez,Jorge Civera

doi:10.1109/taslp.2021.3133216

Abstract

Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience on this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal perfomance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows on a main Spanish broadcasting station.

Highlights

L IVE video streaming services over the Internet have increased dramatically in recent years because of higher user demand and bandwidth speeds
Language Model History Recombination (LMHR) parameter is needed to control the length of histories since, in History Conditioned Search (HCS) decoders, hypotheses are grouped according to their history, meaning that without enforcing any back-off recombination previous histories of active hypotheses tend to grow without any limitation
In this work an improved decoder based on the conventional hybrid Automatic Speech Recognition (ASR) approach was proposed by adapting state-of-theart models to the streaming setup

Summary

INTRODUCTION

L IVE video streaming services over the Internet have increased dramatically in recent years because of higher user demand and bandwidth speeds. In contrast to using a sliding window over the incoming signal, a different approach consists in splitting it into overlapping chunks with appended (past and future) contextual observations This approach was followed in [3], where the so-called Context-Sensitive-Chunk (CSC) method was proposed to speed up BLSTM training for low-latency decoding by just adding some delay in between consecutive chunks. Other relevant contributions addressing one-pass decoding with neural LMs have focused on heuristics to reduce the number of queries to the model and catching network states [20], alternative one-pass decoding strategies such as on-the-fly rescoring [21], improving CPUGPU communications [22] and, more recently, combining Gated Recurrent Units with more efficient objective functions, such as Noise Contrastive Estimation [23].

DEEP BIDIRECTIONAL LSTM ACOUSTIC MODELS FOR STREAMING

Streaming decoding using BLSTM acoustic models

Acoustic Model Look-ahead

Acoustic Feature Normalization for Streaming

EFFICIENT ONE-PASS DECODING USING INTERPOLATED NEURAL LMS

Language Model Look-ahead

Neural LM integration

LM pruning parameters

Evaluation Datasets

Training setup

Experiments on acoustic modeling for streaming

Experiments on language modeling for streaming

Findings

CONCLUSION AND FUTURE WORK

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM transactions on audio, speech, and language processing	Publication Date: Jan 1, 2022
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing

Lead the way for us

Similar Papers

Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM
Virender Kadyan ... Poonam Dhiman
International Journal of Speech Technology | VOL. 24
Virender Kadyan, et. al.Virender Kadyan ... Poonam Dhiman
09 Feb 2021
International Journal of Speech Technology | VOL. 24

Part-of-Speech Tagging Using Long Short Term Memory (LSTM): Amazigh Text Written in Tifinaghe Characters
Otman Maarouf ... Rachid El Ayachi
-
Otman Maarouf, et. al.Otman Maarouf ... Rachid El Ayachi
01 Jan 2020
01 Jan 2020

Flood Forecasting with Deep Learning LSTM Networks: Local vs. Regional Network Training Based on Hourly Data
Tanja Morgenstern ... Niels Schütze
-
Tanja Morgenstern, et. al.Tanja Morgenstern ... Niels Schütze
15 May 2023
15 May 2023

Ensembles of Deep LSTM Learners for Activity Recognition using Wearables
Yu Guan ... Thomas Plötz
Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies | VOL. 1
Yu Guan, et. al.Yu Guan ... Thomas Plötz
30 Jun 2017
Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing