EXTENDED ABSTRACT:The paper presents three language resources enabling better full-text access to digitised printed historical Slovenian texts: a hand-annotated corpus, a hand-annotated lexicon of historical words and a collection of transcribed texts. The aim of the resources is twofold: on one hand they support empirical linguistic research (corpus, collection) and represent a reference tool for the research of historical Slovenian (lexicon) while on the other hand they may serve as training data for the development of Human Language Technologies enabling better full-text search in digital libraries containing Slovenian written cultural heritage, modernisation of historical texts, and the development of better technological solutions for text recognition and scanning. The hand annotated corpus of historical Slovenian contains the text from 1,000 pages sampled from the years 1750 to 1900, two texts date to the end of the 16th or 17th century. The corpus contains a little more than 250,000 word tokens; each of them being annotated with hand validated linguistic features: modernised form, lemma or base form, and morhpo-syntactic description. Thus the word token »ajfram« is annotated with the normalised form »ajfrom«, by the lemma »ajfer« and morphosyntactic description »Som« or »Samostalnik« (noun), »občni« (common), »moški« (masculine) and a modernised form »gorečnost« (fervour). At first the corpus was annotated automatically and then manually verified and corrected. The lexicon was created automatically from the hand-annotated corpus. It contains only attested word-forms and examples of use. The word-forms are ordered under their modern equivalents. All the modern forms of a particular word constitute a dictionary entry, defined by its lemma with conjoint information i.e. the morpho-syntactic description and the closest contemporary synonyms. Thus the entry »ajfrer/Som/gorečnost« is annotated by two modernised words »ajfra « and »ajfrom« and their archaic forms »ajfram« and »aifram« and by attestattion: »…shaz noi frihtei tu shebranje karbo sdei udrukono is velzhim aifram noi is flisam inu is andohtjo 3 vezhiere saporedama …« (Tapravi inu tazieli Colemone-Shegen, 1800, p. 183). At present, the lexicon contains over 25,000 entries (including modern words in archaic texts), 50,000 word-forms and 70,000 archaic forms. The third resource is represented by an extensive collection of digitised texts similar to the corpus. The difference is that the words are annotated automatically by a tool developed to process historical Slovenian text named ToTrTaLe. The tool implements a pipeline, where it first tokenises the text and then attempts to transcribe the archaic words to their modern day equivalents. Then, the text is tagged and lemmatised using the models for modern Slovenian language. It contains about 5 million words of hand-corrected transcriptions from the following digitised texts: • Slovenian books and editions of the newspaper »Kmetijske in rokodelske novice«, digitised by the National University Library (NUK) in the frame of the EU project IMPACT (5000 pages); • Digital library AHLib,1 comprising Slovenian books translated from German (100 books); • A selection of Slovenian books2 All three resources (corpus, lexicon, collection) are encoded according to the Text Encoding Initiative Guidelines TEI P5, which enable the definition of XML schemas for encoding texts for scholarly purposes. The home page of the project at http://nl.ijs.si/imp/ enables access to the resources. The collection and the lexicon are available for on-line browsing, the corpus and the automatically annotated collection for linguistics searches via a concordancer, while all the resources can be also downloaded in their source XML form under the Creative Commons Attribution Licence. In future we expect to extend the resources, however, even their present scope is sufficient for corpus based diachronic studies of historical Slovenian language and for developing useful language technology tools for processing cultural heritage texts.
Read full abstract