Abstract

In Chinese information retrieval, it is easy to index a Chinese text document for retrieval. We just need to segment the text document into phrases. When the document is a Chinese document image (non-ASCII file), we may first convert the document image into the text file by using Chinese optical character recognition (OCR) technology and then index the document by using an information retrieval algorithm. However, OCR needs more time, which can influence retrieval efficiency. This paper proposes an index method based on stroke density code. First segment the document image to get all the Chinese character images, then calculate the stroke density of each Chinese character image, and at last attain the stroke density code of the character image. The index method has the advantage of speed and robustness to noise. In addition, this paper also offers a retrieval method for Chinese document images based on the index technology. We discuss the index and retrieval method for duplicate detection. We have proved the validity of the index method through its application to keyword spotting and duplicate detection.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.