Abstract

A topic dependent class (TDC) [1] language model (LM) is a topic-based LM that uses a semantic extraction method to reveal latent topic information from nouns relation. Then a clustering for a given context is performed to define topics. Finally, a fixed window of word history is observed to decide the topic of the current event through voting in online manner. Previously, we have shown that TDC overperforms several state-of-the-art baselines. There are two separate points that we would like to introduce in this paper. First, we improves the TDC further by incorporating cache-based LM through unigram scaling. The combination is possible since TDC only tried to capture topical words, and does not models re-occurring words, such as functional words, very well. Experiments on Wall Street Journal (WSJ) and Japanese newspaper (Mainichi Shimbun) corpora show that this combination improves the model significantly in terms of perplexity. Second, TDC stand-alone model suffers from shrinking training corpus size when the number of topics is increased. We solved this problem by performing softclustering and soft-voting on the training and test phase. Experiments result on WSJ corpus shows that TDC performance over perform the baseline without being needed to be interpolated with the word-basedn-gram. Index Terms: topic dependent, language model, latent semantic analysis, soft voting, soft clustering, cache

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.