Abstract

Chinese spelling check (CSC) is still an unsolved problem today since there are many homonymous or homomorphous characters. Recently, more and more CSC systems have been proposed. To the best of our knowledge, language modeling is one of the major components among these systems because of its simplicity and moderately good predictive power. After deeply analyzing the school of research, we are aware that most of the systems only employ the conventional n -gram language models. The contributions of this article are threefold. First, we propose a novel probabilistic framework for CSC, which naturally combines several important components, such as the substitution model and the language model, to inherit their individual merits as well as to overcome their limitations. Second, we incorporate the topic language models into the CSC system in an unsupervised fashion. The topic language models can capture the long-span semantic information from a word (character) string while the conventional n -gram language models can only preserve the local regularity information. Third, we further integrate Web resources with the proposed framework to enhance the overall performance. Our rigorously empirical experiments demonstrate the consistent and utility performance of the proposed framework in the CSC task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call