Abstract

Digitization of documents and books is only effective if it is complemented by a search mechanism allowing users retrieve the desired content. This led to a tremendous research in Optical Character Recognition (OCR) systems which convert document images into text allowing search and retrieval facility. In some cases, recognition of text is very challenging due to complexity of script on which OCR systems are fail. This work present a indexing and retrieval based word spotting system for digitized English documents. The document image with English text is segmented into ligatures and each ligature is represented by a set of features. Features are extracted using DCT and DWT. Clustering of ligature is then carried out to group ligature into cluster. An index file is maintained for each cluster which stores all occurrences (locations) of the ligature in the given document.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call