Abstract
Handwritten Text Recognition (HTR) can become progressively abysmal when the documents are damaged with smudges, blemishes and blurs. Recognition of such documents is a challenging task. We, therefore propose a system to identify textual handwritten content in documents where the state-of-the-art Optical Character Recognition (OCR) existing at its full extent performs with low accuracy. By introducing word substitution using character and distance analysis for spell checking and word completion in such areas for giving out more accurate results using a word corpus, we improved our prediction results especially in cases where the OCR is prone to predict false positives on the smudge areas predominantly. Blur detection on every word before segmentation is also substituted with a new word by our OCR algorithm to avoid false positive results and are instead substituted with suitable words. This methodology is far more convenient and reliable since even state-of-the-art HTR technologies do not have more than 71% accuracy. The accuracy of the predicted test is measured using the text similarity metric - Fuzzy Token Set Ratio (FTSR).
Highlights
Smeared documents are those that are hand written or printed hard documents that are exposed to the environment and get destroyed because of foreign objects like liquids, dust and dirt
Optical Character Recognition (OCR) for handwritten documents is still a growing challenge and one way to tackle this problem is by combining it with Natural Language Processing (NLP) for sentence completion until OCR can become mature enough to identify texts from various handwritings, symbols and styles of writing
For Handwritten Text Recognition, we propose a transfer learning methodology approach using an image-based sequence recognition algorithm [12] which runs on six layers of CNNs that help with feature extraction and two layers of RNNs
Summary
Smeared documents are those that are hand written or printed hard documents that are exposed to the environment and get destroyed because of foreign objects like liquids, dust and dirt These foreign objects either cause smudges and blemishes or obfuscate certain characters which cannot be identified using Optical Character Recognition (OCR). OCR for handwritten documents is still a growing challenge and one way to tackle this problem is by combining it with Natural Language Processing (NLP) for sentence completion until OCR can become mature enough to identify texts from various handwritings, symbols and styles of writing. This can take a long time to solve given the various ways in which humans write different characters. There have been significant improvements in segmentation technologies focusing on historical manuscripts but these are very specific and the dataset these have been tested on are finite
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Innovative Science and Modern Engineering
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.