Abstract

The purpose of creating conservation areas is to protect endangered plant and animal species. Large, tagged linguistic corpora with a great variety of genres are used for the preservation and research of safe and endangered languages. The article describes the history, structure and development of the Open Corpus of the Veps and Karelian languages. The Veps language corpus was created in 2009 under the leadership of Nina Zaitseva. Three Karelian subcorpora (Karelian proper, Livvi and Ludian) were included in the linguistic corpus in 2016. The united linguistic platform was named “The Open Corpus of the Veps and Karelian languages” (VepKar). This linguistic corpus includes texts and dictionaries stored in a database, and a computer program (corpus manager) for searching and processing the data. This corpus manager was written in the PHP programming language in the Laravel framework. The data are stored in a MySQL database. Corpus and dictionaries data are available online (dictorpus.krc.karelia.ru). YouTube and Wikipedia are used by VepKar authors to popularize the corpus. Dictionaries and corpus texts are strongly interrelated. Multifunctional dictionaries of the Veps and Karelian languages contain definition, translation, dialect labels, semantic relations (synonyms, antonyms, etc.), examples of word usage with reference to texts, as well as complete inflectional paradigms. All texts are automatically marked up and there are references from words in the text to the corresponding meanings in the dictionary entries. The developers continue adding useful new features to the corpus manager to make the work of editors easier. For example, over the past three years, nominal and verbal inflection rules have been formulated and programmed for all dialects of the Veps language and its newly-written version, as well as for the Livvi-Karelian, North Karelian and Tver newly-written versions of the Karelian language. Thanks to this, 2.1 million word forms were generated in the VepKar system in a semi-automatic mode. The semantic markup in the corpus is 2.1 million links between words from the text and the meanings of lemmas in the dictionary. The grammatical markup was added, namely, 1.1 million links between words from the text and the grammatical features of word forms from the dictionary were automatically established. The multilingual VepKar corpus is divided into subcorpora according to languages and dialects, and the texts are also classified into styles and genres. The corpus has a sophisticated search system (with filtering of texts by language, style and dialect, by informant, collector or author, by year of recording or year of publication). It is possible to search for lemmas by dialects, parts of speech, grammatical features, and even by lexical-semantic categories. These categories appeared due to the integration of the data of the outstanding “Comparative and Onomasiological Dictionary of the Dialects of the Karelian, Veps and Sami Languages” into the vocabulary part of VepKar. In 2021, the Sanahelmi electronic dictionary was created on the basis of VepKar for Android phones. The development of mobile applications based on corpus data is our bright future.

Highlights

  • Ключевыеслова: карельский язык; вепсский язык; корпусная лингвистика; Открытый корпус вепсского и карельского языков; корпусный менеджер; словоизменительная парадигма

  • Large, tagged linguistic corpora with a great variety of genres are used for the preservation and research of safe and endangered languages

  • The united linguistic platform was named “The Open Corpus of the Veps and Karelian languages” (VepKar). This linguistic corpus includes texts and dictionaries stored in a database, and a computer program for searching and processing the data

Read more

Summary

Архитектура и количественные характеристики корпуса

5. Распределение текстов ВепКар по жанрам Fig. 5. 6. Распределение числа текстов ВепКар по годам записи (информантов), дате публикации и добавления в корпус Fig. 6. Постоянное комплектование корпуса подобными материалами способствует популяризации карельского и вепсского языков, а также решению целого ряда просветительских, образовательных и исследовательских задач (не только в области литературоведения, но и в сфере культурологии, лингвистики и др.). Увеличение числа художественных текстов в ВепКаре, равно как и продолжающаяся работа по введению в корпус новых лемм и его разметке, открывает широкие перспективы для исследования языка карелоязычной прозы и поэзии. Развитие подкорпуса художественных текстов может способствовать разработке программ для решения целого спектра прикладных задач, например, реконструкции утраченных слов и частей текстов литературного произведения, выбора вариантов текста из черновиков автора при подготовке не изданных им при жизни текстов и др.

Возможности и приложения корпуса
Мобильные приложения как наступившее лингвистическое будущее
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.