Abstract

In this paper, a new representation of Farsi words is proposed to present the keyword spotting problems in Farsi document image retrieval. In this regard, we define a signature for each Farsi word based on the word connected component layout. The mentioned signature is shown as boxes, and then, by sketching vertical and horizontal lines, we construct a grid of each word to provide a new descriptor. One of the advantages of this method is that it can be used for both handwritten and machine-printed texts. Finally, to evaluate the performance of our system in comparison to other methods, a database that contains 19,582 printed Farsi words is examined, and after applying this approach, a recall rate of 98.1% and a precision rate of 94.3% are obtained.

Highlights

  • Due to the increase in digital libraries and paper documents in offices, their organization and management take significant amounts of time and energy

  • To search for a keyword in document images, first of all, by optical character recognition (OCR), we have to convert the format of document images from pictorial format to text format, which is translatable by the machine [1], and by the use of the traditional methods of document retrieval, the target word is sought in the text

  • OCR is frequently used by researchers in this area, it has some disadvantages that cause OCR to be inappropriate in all retrieval cases

Read more

Summary

Introduction

Due to the increase in digital libraries and paper documents in offices, their organization and management take significant amounts of time and energy. The upper contours of words are extracted and a picture dictionary of these features is made, and each subword is shown as a combination of contour strokes that includes upper, lower, and middle positions of the baseline As another example, the work proposed in [23] depends on the feature of the shape of printed words in the recognition of Arabic texts written in three different fonts, two of which are synthetic. According to a literature review above and considering the method discussed in [2, 3], in this paper, we propose a new model for machineprinted Farsi text retrieval based on the similarities of layout of components in Farsi words. The remainder of this paper is organized as follows: Section 2 describes our proposed method, Section 3 summarizes the experimental results, and, lastly, Section 4 presents conclusions of this paper

Preprocessing
Experimental results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.