Abstract

One of the challenges in digitization of Urdu text is document image noise removal. In this paper we aim to remove punch-hole and background noise from images of handwritten Urdu text. The techniques for English language cannot be applied here because Urdu is structurally very different. Also, Urdu ligature has diacritics. These diacritics change their position with addition of characters. Removing these noises will play a great role in facilitating the recognition phase of the OCR. We've come to find out that due to the peculiar characteristics of individual Urdu handwriting, there's no standard to distinguish the main content against the noise. Different algorithms accomplish different tasks and some of them require a lot of adjustments and optimizations but there is no unified way of removing all kinds of noise. Some of the noise can be handled by cleverly manipulating the grey-values of the images, others require more sophisticated algorithms. One major challenge that one faces when working with handwritten Urdu text is the problem of collecting the data set. The data set is so diverse and unique that it becomes hard to recognize patterns among them with traditional algorithms. Another interesting fact is that there is no standard OCR for handwritten Urdu text. Our system was manually tested. To conclude, noise removal in Urdu is a major task having far-reaching implications which make this area all the more fun and interesting. With our proposed algorithm we were able to remove 93% of background noise and 96% punchhole noise.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call