Integration of cache-based model and topic dependent class model with soft clustering and soft voting

Welly Naptali,Seiichi Nakagawa,Masatoshi Tsuchiya

doi:10.21437/interspeech.2010-527

Abstract

A topic dependent class (TDC) [1] language model (LM) is a topic-based LM that uses a semantic extraction method to reveal latent topic information from nouns relation. Then a clustering for a given context is performed to define topics. Finally, a fixed window of word history is observed to decide the topic of the current event through voting in online manner. Previously, we have shown that TDC overperforms several state-of-the-art baselines. There are two separate points that we would like to introduce in this paper. First, we improves the TDC further by incorporating cache-based LM through unigram scaling. The combination is possible since TDC only tried to capture topical words, and does not models re-occurring words, such as functional words, very well. Experiments on Wall Street Journal (WSJ) and Japanese newspaper (Mainichi Shimbun) corpora show that this combination improves the model significantly in terms of perplexity. Second, TDC stand-alone model suffers from shrinking training corpus size when the number of topics is increased. We solved this problem by performing softclustering and soft-voting on the training and test phase. Experiments result on WSJ corpus shows that TDC performance over perform the baseline without being needed to be interpolated with the word-basedn-gram. Index Terms: topic dependent, language model, latent semantic analysis, soft voting, soft clustering, cache

Full Text