Recognition-free search in graphics stream of PDF

A R Balasubramanian ,C V Jawahar

doi:10.3233/wdl-120016

Abstract

Digital libraries are becoming integral part of our day-to-day life. Digitized books and manuscripts in many of these digital libraries are often stored as images or graphics. Very often, they cannot be searched at the content level due to the lack of robust character recognizers. PDF (portable document format) has emerged as one of the most popular document representation schema in digital libraries, especially for storing scanned documents. When there is no textual (UNICODE, ASCII) representation available, scanned images are stored in the graphics stream of PDF. In this paper, we describe a solution to search the textual data in the graphics stream of the PDF files, at the content level. The proposed solution is demonstrated by enhancing an open source PDF viewer (Xpdf). Indian language support is also provided. Users can type a word in Roman (ITRANS), view it in a font, and simultaneously search in textual and graphics stream of PDF.

Full Text