HEURISTIC CLASSIFICATION OF OFFICE DOCUMENTS

Xiaolong Hao,Peter A Ng,Jason T.L Wang,Michael P Bieber

doi:10.1142/s0218213094000121

Abstract

Document Processing Systems (DPSs) support office workers to manage information. Document classification is a major function of DPSs. By analyzing a document’s layout and conceptual structures, we present in this paper a sample-based approach to document classification. We represent a document’s layout structure by an ordered labeled tree through a procedure known as nested segmentation and represent the document’s conceptual structure by a set of attribute type pairs. The layout similarities between the document to be classified and sample documents are determined by a previously developed approximate tree matching toolkit. The conceptual similarities between the documents are determined by analyzing their contents and by calculating the degree of conceptual closeness. The document type is identified by computing both the layout and conceptual similarities between the document to be classified and the samples in the document sample base. Some experimental results are presented, which demonstrate the effectiveness of the proposed techniques.

Full Text