Abstract

We propose to generalize language models for conversational speech recognition to allow them to operate across utterance boundaries and speaker changes, thereby capturing conversation-level phenomena such as adjacency pairs, lexical entrainment, and topical coherence. The model consists of a long-short-term memory (LSTM) recurrent network that reads the entire word-level history of a conversation, as well as information about turn taking and speaker overlap, in order to predict each next word. The model is applied in a rescoring framework, where the word history prior to the current utterance is approximated with preliminary recognition results. In experiments in the conversational telephone speech domain (Switchboard) we find that such a model gives substantial perplexity reductions over a standard LSTM-LM with utterance scope, as well as improvements in word error rate.

Highlights

  • Over the past decade the state of the art in language modeling has shifted from N-gram models to feed-forward networks (Bengio et al, 2006), and to recurrent neural networks (RNNs) that read a list of words sequentially and predict the word at each position

  • long-short-term memory (LSTM)-language model (LM) + Session LSTM-LM 9.22 6.45 12.11 plexity was evaluated on reference transcripts, as is customary

  • We have proposed a simple generalization of utterance-level LSTM language models aimed at capturing conversational phenomena that operate across utterances and speakers, such as lexical entrainment, adjacency pairs, speech overlap, and topical coherence

Read more

Summary

Introduction

Over the past decade the state of the art in language modeling has shifted from N-gram models to feed-forward networks (Bengio et al, 2006), and to recurrent neural networks (RNNs) that read a list of words sequentially and predict the word at each position. The potential advantage of unlimited history, is not commonly used to its full benefit, since the language model (LM) is typically “reset” at the start of each utterance in current stateof-the-art recognition systems (Saon et al, 2017; Xiong et al, 2018). Based on the RNN framework, (Mikolov and Zweig, 2012) proposed augmenting network inputs with a more slowly varying context vector that would encode longer-range properties of the history, such as a latent semantic indexing vector. The problem with these approaches is that the modeler has to make design decisions about how to encapsulate contextual information as network inputs. Our approach here is to provide the entire conversation history as input to a standard LSTM-LM, and let the network learn the information that is relevant to next-word prediction

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.