Abstract
Language modeling using n-gram is popular for speech recognition and many other applications. The conventional ngram suffers from the insufficiencies of training data, domain knowledge and long distance language dependencies. This paper presents a new approach to mining long distance word associations and incorporating their mutual information into language models. We aim to discover the associations of multiple distant words from training corpus. An efficient algorithm is exploited to merge the frequent word subsets and construct the association patterns. The resulting association pattern n-gram is general with a special realization to trigger pair n-gram where only associations of two distant words are considered. To improve the modeling, we further compensate the weaknesses of sparse training data via parameter smoothing and domain mismatch via online adaptive learning. The proposed association pattern n-gram and several hybrid models are successfully applied for speech recognition. We also find that the incorporation of mutual information of association patterns can significantly reduce the perplexities of language models.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have