Improving recognition accuracy on structured documents by learning structural patterns

A Lorincz,T Marcinkovics,Gy H�V�Zi

doi:10.1007/s10044-004-0208-3

A Lorincz, T Marcinkovics + Show 1 more

https://doi.org/10.1007/s10044-004-0208-3

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

In this paper, we present a probabilistic method that can improve the efficiency of document classification when applied to structured documents. The analysis of the structure of a document is the starting point of document classification. Our method is designed to augment other classification schemes and complement pre-filtering information extraction procedures to reduce uncertainties. To this end, a probabilistic distribution on the structure of XML documents is introduced. We show how to parameterise existing learning methods to describe the structure distribution efficiently. The learned distribution is then used to predict the classes of unseen documents. Novelty detection making use of the structure-based distribution function is also discussed. Demonstration on model documents and on Internet XML documents are presented.

Full Text