Word spotting using clustering on extracted DCT and DWT features

Hafiz Adnan Niaz,Usman Akram,Usman Akbar

doi:10.1109/iceet1.2018.8338629

Abstract

Digitization of documents and books is only effective if it is complemented by a search mechanism allowing users retrieve the desired content. This led to a tremendous research in Optical Character Recognition (OCR) systems which convert document images into text allowing search and retrieval facility. In some cases, recognition of text is very challenging due to complexity of script on which OCR systems are fail. This work present a indexing and retrieval based word spotting system for digitized English documents. The document image with English text is segmented into ligatures and each ligature is represented by a set of features. Features are extracted using DCT and DWT. Clustering of ligature is then carried out to group ligature into cluster. An index file is maintained for each cluster which stores all occurrences (locations) of the ligature in the given document.

Full Text