Abstract

Document Analysis has major importance in Information Retrieval Systems. Dredged with vaults of paper and material documents, to protect very important information and the summaries, without losing their meaning and importance, each document need to be properly curated and processed. Ancient written documents possess many types of cursive language character sets, which are very tedious to discriminate the characters and subsequently the right meaning. To overcome the difficulties of reading the cursive language characters and prevent misunderstanding the meaning and the importance of documents, an improvised CNN [6] model to work on OCR and Tesseract API has been proposed in this work. The documents are scanned, curated and preprocessed in the forms of images. CNN are the best algorithms, hitherto in the existing AI and Deep Learning arena. CNN with OCR API could contribute to the development of efficient strategies of character recognition even with complex cursive styles. A method which is adaptable to the classification and segmentation of the text images with cursive styles is proposed I this article. Tesseract is the popular and effective OCR library with rich API that can enrich the CNN-OCR model

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call