Abstract

Chinese spell checking is an important research topic, and it is needed in several applications, such as optical character recognition, speech recognition, and search engines. Due to the specialties of Chinese characters, such as shape and pronunciation similarity, it is still a problem for the computer to detect and correct Chinese spell errors automatically. In this paper, we propose a hybrid approach to detecting and correcting a common class of Chinese word errors, called auxiliary word errors. First, to address the lack of dataset containing Chinese auxiliary errors, we generate artificial dataset of auxiliary errors by an auxiliary confusion set and a large Web corpus. Second, we propose a neural network detection model which adopts BERT as the embedding layer, and combines BiLSTM with CRF. Third, we utilize an auxiliary confusion set and a recurrent neural network language model (RNNLM) to correct auxiliary errors in text. Experimental results on different test datasets show our hybrid approach achieves better performance than traditional baseline methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call