The purpose of creating conservation areas is to protect endangered plant and animal species. Large, tagged linguistic corpora with a great variety of genres are used for the preservation and research of safe and endangered languages. The article describes the history, structure and development of the Open Corpus of the Veps and Karelian languages. The Veps language corpus was created in 2009 under the leadership of Nina Zaitseva. Three Karelian subcorpora (Karelian proper, Livvi and Ludian) were included in the linguistic corpus in 2016. The united linguistic platform was named “The Open Corpus of the Veps and Karelian languages” (VepKar). This linguistic corpus includes texts and dictionaries stored in a database, and a computer program (corpus manager) for searching and processing the data. This corpus manager was written in the PHP programming language in the Laravel framework. The data are stored in a MySQL database. Corpus and dictionaries data are available online (dictorpus.krc.karelia.ru). YouTube and Wikipedia are used by VepKar authors to popularize the corpus. Dictionaries and corpus texts are strongly interrelated. Multifunctional dictionaries of the Veps and Karelian languages contain definition, translation, dialect labels, semantic relations (synonyms, antonyms, etc.), examples of word usage with reference to texts, as well as complete inflectional paradigms. All texts are automatically marked up and there are references from words in the text to the corresponding meanings in the dictionary entries. The developers continue adding useful new features to the corpus manager to make the work of editors easier. For example, over the past three years, nominal and verbal inflection rules have been formulated and programmed for all dialects of the Veps language and its newly-written version, as well as for the Livvi-Karelian, North Karelian and Tver newly-written versions of the Karelian language. Thanks to this, 2.1 million word forms were generated in the VepKar system in a semi-automatic mode. The semantic markup in the corpus is 2.1 million links between words from the text and the meanings of lemmas in the dictionary. The grammatical markup was added, namely, 1.1 million links between words from the text and the grammatical features of word forms from the dictionary were automatically established. The multilingual VepKar corpus is divided into subcorpora according to languages and dialects, and the texts are also classified into styles and genres. The corpus has a sophisticated search system (with filtering of texts by language, style and dialect, by informant, collector or author, by year of recording or year of publication). It is possible to search for lemmas by dialects, parts of speech, grammatical features, and even by lexical-semantic categories. These categories appeared due to the integration of the data of the outstanding “Comparative and Onomasiological Dictionary of the Dialects of the Karelian, Veps and Sami Languages” into the vocabulary part of VepKar. In 2021, the Sanahelmi electronic dictionary was created on the basis of VepKar for Android phones. The development of mobile applications based on corpus data is our bright future.
Read full abstract