Historical Slovenian Language Resources

Tomaž Erjavec

doi:10.55741/knj.56.3.14316

Abstract

EXTENDED ABSTRACT:The paper presents three language resources enabling better full-text access to digitised printed historical Slovenian texts: a hand-annotated corpus, a hand-annotated lexicon of historical words and a collection of transcribed texts. The aim of the resources is twofold: on one hand they support empirical linguistic research (corpus, collection) and represent a reference tool for the research of historical Slovenian (lexicon) while on the other hand they may serve as training data for the development of Human Language Technologies enabling better full-text search in digital libraries containing Slovenian written cultural heritage, modernisation of historical texts, and the development of better technological solutions for text recognition and scanning. The hand annotated corpus of historical Slovenian contains the text from 1,000 pages sampled from the years 1750 to 1900, two texts date to the end of the 16th or 17th century. The corpus contains a little more than 250,000 word tokens; each of them being annotated with hand validated linguistic features: modernised form, lemma or base form, and morhpo-syntactic description. Thus the word token »ajfram« is annotated with the normalised form »ajfrom«, by the lemma »ajfer« and morphosyntactic description »Som« or »Samostalnik« (noun), »občni« (common), »moški« (masculine) and a modernised form »gorečnost« (fervour). At first the corpus was annotated automatically and then manually verified and corrected. The lexicon was created automatically from the hand-annotated corpus. It contains only attested word-forms and examples of use. The word-forms are ordered under their modern equivalents. All the modern forms of a particular word constitute a dictionary entry, defined by its lemma with conjoint information i.e. the morpho-syntactic description and the closest contemporary synonyms. Thus the entry »ajfrer/Som/gorečnost« is annotated by two modernised words »ajfra « and »ajfrom« and their archaic forms »ajfram« and »aifram« and by attestattion: »…shaz noi frihtei tu shebranje karbo sdei udrukono is velzhim aifram noi is flisam inu is andohtjo 3 vezhiere saporedama …« (Tapravi inu tazieli Colemone-Shegen, 1800, p. 183). At present, the lexicon contains over 25,000 entries (including modern words in archaic texts), 50,000 word-forms and 70,000 archaic forms. The third resource is represented by an extensive collection of digitised texts similar to the corpus. The difference is that the words are annotated automatically by a tool developed to process historical Slovenian text named ToTrTaLe. The tool implements a pipeline, where it first tokenises the text and then attempts to transcribe the archaic words to their modern day equivalents. Then, the text is tagged and lemmatised using the models for modern Slovenian language. It contains about 5 million words of hand-corrected transcriptions from the following digitised texts: • Slovenian books and editions of the newspaper »Kmetijske in rokodelske novice«, digitised by the National University Library (NUK) in the frame of the EU project IMPACT (5000 pages); • Digital library AHLib,1 comprising Slovenian books translated from German (100 books); • A selection of Slovenian books2 All three resources (corpus, lexicon, collection) are encoded according to the Text Encoding Initiative Guidelines TEI P5, which enable the definition of XML schemas for encoding texts for scholarly purposes. The home page of the project at http://nl.ijs.si/imp/ enables access to the resources. The collection and the lexicon are available for on-line browsing, the corpus and the automatically annotated collection for linguistics searches via a concordancer, while all the resources can be also downloaded in their source XML form under the Creative Commons Attribution Licence. In future we expect to extend the resources, however, even their present scope is sufficient for corpus based diachronic studies of historical Slovenian language and for developing useful language technology tools for processing cultural heritage texts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Historical Slovenian Language Resources

Abstract

Talk to us

Similar Papers

More From: Knjižnica: revija za področje bibliotekarstva in informacijske znanosti

Lead the way for us

Similar Papers

Full-Text Search in the Resources of Polish Digital Libraries
Arkadiusz Pulikowski
Zagadnienia Informacji Naukowej - Studia Informacyjne | VOL. 60
Arkadiusz PulikowskiArkadiusz Pulikowski
16 Feb 2023
Zagadnienia Informacji Naukowej - Studia Informacyjne | VOL. 60

Leveraging a Federation of Knowledge Graphs to Improve Faceted Search in Digital Libraries
Golsa Heidari ... Ahmad Ramadan
-
Golsa Heidari, et. al.Golsa Heidari ... Ahmad Ramadan
01 Jan 2020
01 Jan 2020

The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries
Ulrich Schäfer ... Benjamin Weitz
LIBER Quarterly: The Journal of the Association of European Research Libraries | VOL. 22
Ulrich Schäfer, et. al.Ulrich Schäfer ... Benjamin Weitz
21 Feb 2013
LIBER Quarterly: The Journal of the Association of European Research Libraries | VOL. 22

Semantic-oriented Architectures and Use of Ontology for Organizing Adaptive Search in Digital Libraries
Albena Turnina
Serdica Journal of Computing | VOL. 11
Albena TurninaAlbena Turnina
30 Nov 2018
Serdica Journal of Computing | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Historical Slovenian Language Resources

Abstract

Talk to us

Similar Papers

More From: Knjižnica: revija za področje bibliotekarstva in informacijske znanosti