Abstract

Grammatical descriptions of languages of the world form a sub-genre of scholarly documents in the field of linguistics. A document of this genre may be modeled as a concatenation of table of contents, sociolinguistic description, phonological description, morphosyntactic description, comparative remarks, lexicon, text, bibliography and index (where morphosyntactic description is the only mandatory section). Separation of these parts is useful for information extraction, bibliometrics and information content analysis. Using a collection of over 10 000 digitized grammatical descriptions and an associated bibliography with document-level categorizations, we show that standard techniques from text classification can be adapted to classify individual pages. Assuming that the divisions of interest form continuous page ranges, we can achieve the sought after division in a transparent way. In contrast to previous work on similar tasks in other domains, no use is made of formatting cues, no additional annotated data is needed, high-quality OCR is not required, and the document collection is highly multilingual.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call