Content-based indexing and retrieval method of Chinese document images

Yaodong He Yaodong He,Zao Jiang Zao Jiang,Hong Zhao Hong Zhao,Bing Liu Bing Liu

doi:10.1109/icdar.1999.791880

Abstract

In Chinese information retrieval, it is easy to index a Chinese text document for retrieval. We just need to segment the text document into phrases. When the document is a Chinese document image (non-ASCII file), we may first convert the document image into the text file by using Chinese optical character recognition (OCR) technology and then index the document by using an information retrieval algorithm. However, OCR needs more time, which can influence retrieval efficiency. This paper proposes an index method based on stroke density code. First segment the document image to get all the Chinese character images, then calculate the stroke density of each Chinese character image, and at last attain the stroke density code of the character image. The index method has the advantage of speed and robustness to noise. In addition, this paper also offers a retrieval method for Chinese document images based on the index technology. We discuss the index and retrieval method for duplicate detection. We have proved the validity of the index method through its application to keyword spotting and duplicate detection.

Full Text