Document image dataset indexing and compression using connected components clustering

Houssem Chatbri,Keisuke Kameyama

doi:10.1109/mva.2015.7153182

Abstract

We present a method for document image dataset indexing and compression by clustering of connected components. Our method extracts connected components from each dataset image and performs component clustering to make a hash table that is a compressed indexing of the dataset. Clustering is based on component similarity which is estimated by comparing shape features extracted from the components. Then, the hash table is saved in a text file, and the text file is further compressed using any available compression methodology. Component encoding in the hash table is storage efficient and done using components' contour points and a reduced number of interior points that are sufficient for component reconstruction. We evaluate our method's performances in indexing and compression using four document image datasets. Experimental results show that indexing significantly improves efficiency when used in document image retrieval. In addition, comparative evaluation with two compression standards, namely the ZIP and XZ formats, show competitive performances. Our compression rates are below 20% and the compression errors are very low being at the order of 10−6% per image.

Full Text