Abstract

In this paper, we describe the THUEE (Department of Electronic Engineering, Tsinghua University) team's method of building language models (LMs) for the OpenKWS 2015 Evaluation held by the National Institute of Standards and Technology (NIST). Due to the very limited in-domain data provided by NIST, it takes most of our time and efforts to make good use of the out-of-domain data. There are three main steps in our work. Firstly, data cleaning has been done on the out-of-domain data. Secondly, by comparing the cross-entropy difference between the in-domain data and out-of-domain data, a part of the out-of-domain corpus which is well-matched to the in-domain one has been selected as the training corpus. Thirdly, the final n-gram LM is obtained by interpolating individual n-gram LMs according to different training corpus and all the training data is further combined to train one feed-forward neural network LM (FNNLM). In this way, we reduce the perplexity on development test data by 8.3% for n-gram LM and 1.7% for FNNLM, and the Actual Term-Weighted Value (ATWV) of the final result is 0.5391.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call