Abstract

In this paper, we compare the efficacy of a variety of language models (LMs) for rescoring word graphs and N-best lists generated by a large vocabulary continuous speech recognizer. These LMs differ based on the level of knowledge used (word, lexical features, syntax) and the type of integration of that knowledge (tight or loose). The trigram LM incorporates word level information; our Part-of-Speech (POS) LM uses word and lexical class information in a tightly coupled way; our new SuperARV LM tightly integrates word, a richer set of lexical features than POS, and syntactic dependency information; and the Parser LM integrates some limited word information, POS, and syntactic information. We also investigate LMs created using a linear interpolation of LM pairs. When comparing each LM on the task of rescoring word graphs or N-best lists for the Wall Street Journal (WSJ) 5k- and 20k- vocabulary test sets, the SuperARV LM always achieves the greatest reduction in word error rate (WER) and the greatest increase in sentence accuracy (SAC). On the 5k test sets, the SuperARV LM obtains more than a 10% relative reduction in WER compared to the trigram LM, and on the 20k test set more than 2%. Additionally, the SuperARV LM performs comparably to or better than the interpolated LMs. Hence, we conclude that the tight coupling of knowledge from all three levels is an effective method of constructing high quality LMs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call