An experimental evaluation of OCR text representations for learning document classifiers

Markus Junker,Rainer Hoch

doi:10.1007/s100320050012

An experimental evaluation of OCR text representations for learning document classifiers

Markus Junker, Rainer Hoch

https://doi.org/10.1007/s100320050012

Copy DOI

Journal: International Journal on Document Analysis and Recognition	Publication Date: Jul 1, 1998
Citations: 35

Affiliation: German Research Centre for Artificial Intelligence, Systems, Applications & Products in Data Processing (Germany)

#OCR Texts #Use Of N-grams + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

In the literature, many feature types are proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not yet been done. In particular, evaluations on OCR documents are very rare. In this paper we investigate seven text representations based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.

Full Text