Abstract

Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience on this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal perfomance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows on a main Spanish broadcasting station.

Highlights

  • L IVE video streaming services over the Internet have increased dramatically in recent years because of higher user demand and bandwidth speeds

  • Language Model History Recombination (LMHR) parameter is needed to control the length of histories since, in History Conditioned Search (HCS) decoders, hypotheses are grouped according to their history, meaning that without enforcing any back-off recombination previous histories of active hypotheses tend to grow without any limitation

  • In this work an improved decoder based on the conventional hybrid Automatic Speech Recognition (ASR) approach was proposed by adapting state-of-theart models to the streaming setup

Read more

Summary

INTRODUCTION

L IVE video streaming services over the Internet have increased dramatically in recent years because of higher user demand and bandwidth speeds. In contrast to using a sliding window over the incoming signal, a different approach consists in splitting it into overlapping chunks with appended (past and future) contextual observations This approach was followed in [3], where the so-called Context-Sensitive-Chunk (CSC) method was proposed to speed up BLSTM training for low-latency decoding by just adding some delay in between consecutive chunks. Other relevant contributions addressing one-pass decoding with neural LMs have focused on heuristics to reduce the number of queries to the model and catching network states [20], alternative one-pass decoding strategies such as on-the-fly rescoring [21], improving CPUGPU communications [22] and, more recently, combining Gated Recurrent Units with more efficient objective functions, such as Noise Contrastive Estimation [23].

DEEP BIDIRECTIONAL LSTM ACOUSTIC MODELS FOR STREAMING
Streaming decoding using BLSTM acoustic models
Acoustic Model Look-ahead
Acoustic Feature Normalization for Streaming
EFFICIENT ONE-PASS DECODING USING INTERPOLATED NEURAL LMS
Language Model Look-ahead
Neural LM integration
LM pruning parameters
Evaluation Datasets
Training setup
Experiments on acoustic modeling for streaming
Experiments on language modeling for streaming
Findings
CONCLUSION AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call