Word level correction in Gujarati document using probabilistic approach

Dhruv B Patel,Mukesh M Goswami

doi:10.1109/icgccee.2014.6921395

Abstract

Post processing is an important part of any document processing system. There are two ways of post processing. First word level correction and second sentence level correction in document. The word level is performed in two ways first, finding error and finding dictionary by most similar word. That is called dictionary based approach. Another method to find most probable word is known as probabilistic approach. In order to generate the probabilistic model which includes unigram, bigram, trigram, online resources from various Gujarati newspaper websites are used. The proposed system will use models like Naive Bayes and Hidden Markov Model to correct word level error. The system will be tested on synthetic dataset which is generated by adding random word level error in the actual document.

Full Text