Variable-length categoryn-gram language models

T.R Niesler,P.C Woodland

doi:10.1006/csla.1998.0115

Abstract

This paper presents a language model based onn-grams of word groups (categories). The length of eachn-gram is increased selectively according to an estimate of the resulting improvement in predictive quality. This allows the model size to be controlled while including longer-range dependencies when these benefit performance. The categories are chosen to correspond to part-of-speech classifications in a bid to exploita priorigrammatical information. To account for different grammatical functions, the language model allows words to belong to multiple categories, and implicitly involves a statistical tagging operation which may be used to label new text. Intrinsic generalization by the category-based model leads to good performance with sparse data sets. However word-basedn-grams deliver superior average performance as the amount of training material increases. Nevertheless, the category model continues to supply better predictions for wordn-tuples not present in the training set. Consequently, a method allowing the two approaches to be combined within a backoff framework is presented. Experiments with the LOB, Switchboard and Wall Street Journal corpora demonstrate that this technique greatly improves language model perplexities for sparse training sets, and offers significantly improved size vs. performance tradeoffs when compared with standard trigram models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Variable-length categoryn-gram language models

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Journal: Computer Speech & Language	Publication Date: Jan 1, 1999
Citations: 27

Similar Papers

Using sparse photometric data sets for asteroid lightcurve studies
Brian D Warner ... Alan W Harris
Icarus | VOL. 216
Brian D Warner, et. al.Brian D Warner ... Alan W Harris
20 Oct 2011
Icarus | VOL. 216

An effective crosswell seismic traveltime-estimation approach for quasi-continuous reservoir monitoring
Adeyemi Arogunmati ... Jerry M Harris
GEOPHYSICS | VOL. 77
Adeyemi Arogunmati, et. al.Adeyemi Arogunmati ... Jerry M Harris
01 Mar 2012
GEOPHYSICS | VOL. 77

MM-Cubing: computing Iceberg cubes by factorizing the lattice space
...
-
, et. al. ...
21 Jun 2004
21 Jun 2004

Stratigraphic uncertainty in sparse versus rich data sets in a fluvial-deltaic outcrop analog: Ferron Notom delta in the Henry Mountains region, southern Utah
Weiguo Li ... Janok P Bhattacharya
AAPG Bulletin | VOL. 96
Weiguo Li, et. al.Weiguo Li ... Janok P Bhattacharya
01 Mar 2012
AAPG Bulletin | VOL. 96

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Variable-length categoryn-gram language models

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language