Abstract

In this paper, we present a noise model for generating synthetic character databases to train Optical Character Recognition (OCR) systems. Nowadays, the emergence of new font typefaces requires an imperative task to automatically and rapidly generate synthetic training character databases. In addition, since the accuracy of the OCR systems deeply depends on the number of training samples, a lot of character samples should be generated to retrain OCR systems. However, it is time consuming and laborious to achieve a large size of training samples from real images. Therefore, we develop a noise model to automatically generate synthetic character images in such a way that are very lifelike, without any miserable process of getting images in real life, such as printing, scanning, copying and so on. First, our system generates digital character images. After that, pepper noise, scale noise, and other kind of noises are superimposed to the character images. Since the shape of characters may be distorted through real processing steps, some geometric transformations are applied to the images to mimic this characteristic. As we measure the OCR accuracies, we have observed that the quality of training data obtained either from real world data or by our noise model are comparable. Thus, we believe that using our noise model is a convenient and appropriate way for generating synthetic database to train OCR systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call