Visual information retrieval from historical document images

Sara Zhalehpour,Andrew Piper,Ehsan Arabnejad,Chad Wellmon,Mohamed Cheriet

doi:10.1016/j.culher.2019.05.018

Abstract

In the recent decades, preserving and publicizing historical documents in digital format has gotten considerable attention. Although modern digitizing techniques have mostly solved the problem of protecting and accessing these documents, the task of visual information retrieval and interpretation is still an arduous issue. This is due to historical documents’ complex and unusual structures beside their degraded nature. For information retrieval from historical documents, an appropriate approach is required to characterize the document content in a coherent way. Printed documents contain not only text characters and their formattings but also some associated typographical elements. Finding and pursuing the existing visual typographical objects that shape the content of historical documents, helps us retrieve and convey more information about the various methods of representing these documents. These elements can be footnotes that connect the authority and demonstrate the relationship between manuscripts and sources, or tables that summarize different sort of information into geometric forms. This research focuses on the problem of detecting footnotes and tables in historical documents and establishes a framework for each of the driven objectives. These frameworks must efficiently handle complex structures of historical documents and at the same time possess the generalization power to be applied to large-scale document image collections. To the best of our knowledge, up to this date, footnote detection has rarely been addressed in the literature. Therefore, our first goal is to present a novel framework for footnote-based document image classification in historical documents. The basic idea behind this framework is to utilize the most prominent visual features of a footnote to create a feature vector. The three most notable visual features of a footnote in a page are the smaller font size of the footnote respect to the body text, the footnote location at the bottom of the page and the relatively greater gap between the footnote and the body text compared to the standard line space. Three methods are proposed according to each of these observations. We define a set of rules using these observations to create our final feature vector. Our framework for footnote-based document image classification in the historical documents is completed by feeding these feature vectors to a support vector machine (SVM) classifier. The proposed framework is applied to more than 32 million images from 18th century. The evaluation results prove the efficiency, generalization power, and robustness of our presented framework for detecting page containing footnote despite their layout and structure type. The state-of-the-art methods for table detection in documents mostly use markup documents (e.g., pdf, HTML, etc.) and do not cover all types of the tables within one framework. However, for historical documents, which are our main target for this thesis, we only have access to the scanned image and need to deal with all types of tables at the same time. The proposed framework is based on the hypothesis that texts in tables occur in a harmonic column-wise manner. This fact suggests the idea of using a spectral method for developing our framework. We propose an approach based on using Mel frequency cepstral coefficients (MFCC) to classify document images according to the presence or not presence of tables on the page. MFCCs are well-known speech processing features, which emphasize lower frequency components rather than higher ones. An SVM classifier is used as the final step of our framework for detecting pages containing tables. We test the introduced framework on our datasets and the results confirm the efficiency of the proposed method in comparison to both a state-of-the-art method and our benchmark dataset from the 18th century printed documents.

Full Text