Abstract

Handwritten Text Recognition (HTR) can become progressively abysmal when the documents are damaged with smudges, blemishes and blurs. Recognition of such documents is a challenging task. We, therefore propose a system to identify textual handwritten content in documents where the state-of-the-art Optical Character Recognition (OCR) existing at its full extent performs with low accuracy. By introducing word substitution using character and distance analysis for spell checking and word completion in such areas for giving out more accurate results using a word corpus, we improved our prediction results especially in cases where the OCR is prone to predict false positives on the smudge areas predominantly. Blur detection on every word before segmentation is also substituted with a new word by our OCR algorithm to avoid false positive results and are instead substituted with suitable words. This methodology is far more convenient and reliable since even state-of-the-art HTR technologies do not have more than 71% accuracy. The accuracy of the predicted test is measured using the text similarity metric - Fuzzy Token Set Ratio (FTSR).

Highlights

  • Smeared documents are those that are hand written or printed hard documents that are exposed to the environment and get destroyed because of foreign objects like liquids, dust and dirt

  • Optical Character Recognition (OCR) for handwritten documents is still a growing challenge and one way to tackle this problem is by combining it with Natural Language Processing (NLP) for sentence completion until OCR can become mature enough to identify texts from various handwritings, symbols and styles of writing

  • For Handwritten Text Recognition, we propose a transfer learning methodology approach using an image-based sequence recognition algorithm [12] which runs on six layers of CNNs that help with feature extraction and two layers of RNNs

Read more

Summary

INTRODUCTION

Smeared documents are those that are hand written or printed hard documents that are exposed to the environment and get destroyed because of foreign objects like liquids, dust and dirt These foreign objects either cause smudges and blemishes or obfuscate certain characters which cannot be identified using Optical Character Recognition (OCR). OCR for handwritten documents is still a growing challenge and one way to tackle this problem is by combining it with Natural Language Processing (NLP) for sentence completion until OCR can become mature enough to identify texts from various handwritings, symbols and styles of writing. This can take a long time to solve given the various ways in which humans write different characters. There have been significant improvements in segmentation technologies focusing on historical manuscripts but these are very specific and the dataset these have been tested on are finite

EXISTING SYSTEMS
PROPOSED FRAMEWORK
Blur Detection
Handwriting Text Recognition
Word Substitution
THEORETICAL BACKGROUND
AND DISCUSSION
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call