Abstract

Document Processing Systems (DPSs) support office workers to manage information. Document classification is a major function of DPSs. By analyzing a document’s layout and conceptual structures, we present in this paper a sample-based approach to document classification. We represent a document’s layout structure by an ordered labeled tree through a procedure known as nested segmentation and represent the document’s conceptual structure by a set of attribute type pairs. The layout similarities between the document to be classified and sample documents are determined by a previously developed approximate tree matching toolkit. The conceptual similarities between the documents are determined by analyzing their contents and by calculating the degree of conceptual closeness. The document type is identified by computing both the layout and conceptual similarities between the document to be classified and the samples in the document sample base. Some experimental results are presented, which demonstrate the effectiveness of the proposed techniques.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.