Session-level Language Modeling for Conversational Speech

Wayne Xiong,Jun Zhang,Andreas Stolcke,Lingfeng Wu

doi:10.18653/v1/d18-1296

Abstract

We propose to generalize language models for conversational speech recognition to allow them to operate across utterance boundaries and speaker changes, thereby capturing conversation-level phenomena such as adjacency pairs, lexical entrainment, and topical coherence. The model consists of a long-short-term memory (LSTM) recurrent network that reads the entire word-level history of a conversation, as well as information about turn taking and speaker overlap, in order to predict each next word. The model is applied in a rescoring framework, where the word history prior to the current utterance is approximated with preliminary recognition results. In experiments in the conversational telephone speech domain (Switchboard) we find that such a model gives substantial perplexity reductions over a standard LSTM-LM with utterance scope, as well as improvements in word error rate.

Highlights

Over the past decade the state of the art in language modeling has shifted from N-gram models to feed-forward networks (Bengio et al, 2006), and to recurrent neural networks (RNNs) that read a list of words sequentially and predict the word at each position
long-short-term memory (LSTM)-language model (LM) + Session LSTM-LM 9.22 6.45 12.11 plexity was evaluated on reference transcripts, as is customary
We have proposed a simple generalization of utterance-level LSTM language models aimed at capturing conversational phenomena that operate across utterances and speakers, such as lexical entrainment, adjacency pairs, speech overlap, and topical coherence

Summary

Introduction

Over the past decade the state of the art in language modeling has shifted from N-gram models to feed-forward networks (Bengio et al, 2006), and to recurrent neural networks (RNNs) that read a list of words sequentially and predict the word at each position. The potential advantage of unlimited history, is not commonly used to its full benefit, since the language model (LM) is typically “reset” at the start of each utterance in current stateof-the-art recognition systems (Saon et al, 2017; Xiong et al, 2018). Based on the RNN framework, (Mikolov and Zweig, 2012) proposed augmenting network inputs with a more slowly varying context vector that would encode longer-range properties of the history, such as a latent semantic indexing vector. The problem with these approaches is that the modeler has to make design decisions about how to encapsulate contextual information as network inputs. Our approach here is to provide the entire conversation history as input to a standard LSTM-LM, and let the network learn the information that is relevant to next-word prediction

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Session-level Language Modeling for Conversational Speech

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2018
Citations: 40	License type: cc-by

Similar Papers

Semantic language models for Automatic Speech Recognition
Ali Orkan Bayer ... Giuseppe Riccardi
-
Ali Orkan Bayer, et. al.Ali Orkan Bayer ... Giuseppe Riccardi
01 Dec 2014
01 Dec 2014

Connectionist language modeling for large vocabulary continuous speech recognition
Holger Schwenk ... Jean-Luc Gauvain
-
Holger Schwenk, et. al.Holger Schwenk ... Jean-Luc Gauvain
01 May 2002
01 May 2002

A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition
Albert Zeyer ... Paul Voigtlaender
-
Albert Zeyer, et. al.Albert Zeyer ... Paul Voigtlaender
22 Jun 2016
22 Jun 2016

A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data
Raviraj Joshi ... Anupam Singh
-
Raviraj Joshi, et. al.Raviraj Joshi ... Anupam Singh
01 Jan 2021
01 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Session-level Language Modeling for Conversational Speech

Abstract

Highlights

Summary

Talk to us

Similar Papers