Topic-Dependent-Class-Based $n$-Gram Language Model

Welly Naptali,Masatoshi Tsuchiya,Seiichi Nakagawa

doi:10.1109/tasl.2012.2183870

Abstract

A topic-dependent-class (TDC)-based n -gram language model (LM) is a topic-based LM that employs a semantic extraction method to reveal latent topic information extracted from noun-noun relations. A topic of a given word sequence is decided on the basis of most frequently occuring (weighted) noun classes in the context history through voting. Our previous work (W. Naptali, M. Tsuchiya, and S. Seiichi, “Topic-dependent language model with voting on noun history,”ACM Trans. Asian Language Information Processing (TALIP), vol. 9, no. 2, pp. 1-31, 2010) has shown that in terms of perplexity, TDCs outperform several state-of-the-art baselines, i.e., a word-based or class-based n -gram LM and their interpolation, a cache-based LM, an n -gram-based topic-dependent LM, and a Latent Dirichlet Allocation (LDA)-based topic-dependent LM. This study is a follow up of our previous work and there are three key differences. First, we improve TDCs by employing soft-clustering and/or soft-voting techniques, which solve data shrinking problems and make TDCs independent of the word-based n -gram, in the training and/or test phases. Second, for further improvement, we incorporate a cache-based LM through unigram scaling, because the TDC and cache-based LM capture different properties of the language. Finally, we provide an evaluation in terms of the word error rate (WER) and an analysis of the automatic speech recognition (ASR) rescoring task. Experiments performed on the Wall Street Journal and the Mainichi Shimbun (a Japanese newspaper) demonstrate that the TDC LM improves both perplexity and the WER. The perplexity reduction is up to 25.1% relative on the English corpus and 25.7% relative on the Japanese corpus. Furthermore, the greatest reduction in the WER is 15.2% relative to the English ASR and 24.3 relative to the Japanese ASR, as compared to the baseline.

Full Text