THE LINGUISTIC CORPUS VEPKAR IS A LANGUAGE REFUGE FOR THE BALTICFINNISH LANGUAGES OF KARELIA

Татьяна Петровна Бойко,Nina Zaitseva,Natalia Pellinen,Ирина Петровна Новак,Elizaveta Trubina,Елизавета Денисовна Трубина,Александра Павловна Родионова,Tatyana Boyko,Natalya Krizhanovskaya,Нина Григорьевна Зайцева,Andrey Krizhanovsky,Наталья Борисовна Крижановская,Alexandra Rodionova,Наталия Александровна Пеллинен,Irina Novak,Андрей Анатольевич Крижановский

doi:10.17076/them1415

Татьяна Петровна Бойко, Nina Zaitseva + Show 14 more

Open Access

https://doi.org/10.17076/them1415

Copy DOI

Abstract

The purpose of creating conservation areas is to protect endangered plant and animal species. Large, tagged linguistic corpora with a great variety of genres are used for the preservation and research of safe and endangered languages. The article describes the history, structure and development of the Open Corpus of the Veps and Karelian languages. The Veps language corpus was created in 2009 under the leadership of Nina Zaitseva. Three Karelian subcorpora (Karelian proper, Livvi and Ludian) were included in the linguistic corpus in 2016. The united linguistic platform was named “The Open Corpus of the Veps and Karelian languages” (VepKar). This linguistic corpus includes texts and dictionaries stored in a database, and a computer program (corpus manager) for searching and processing the data. This corpus manager was written in the PHP programming language in the Laravel framework. The data are stored in a MySQL database. Corpus and dictionaries data are available online (dictorpus.krc.karelia.ru). YouTube and Wikipedia are used by VepKar authors to popularize the corpus. Dictionaries and corpus texts are strongly interrelated. Multifunctional dictionaries of the Veps and Karelian languages contain definition, translation, dialect labels, semantic relations (synonyms, antonyms, etc.), examples of word usage with reference to texts, as well as complete inflectional paradigms. All texts are automatically marked up and there are references from words in the text to the corresponding meanings in the dictionary entries. The developers continue adding useful new features to the corpus manager to make the work of editors easier. For example, over the past three years, nominal and verbal inflection rules have been formulated and programmed for all dialects of the Veps language and its newly-written version, as well as for the Livvi-Karelian, North Karelian and Tver newly-written versions of the Karelian language. Thanks to this, 2.1 million word forms were generated in the VepKar system in a semi-automatic mode. The semantic markup in the corpus is 2.1 million links between words from the text and the meanings of lemmas in the dictionary. The grammatical markup was added, namely, 1.1 million links between words from the text and the grammatical features of word forms from the dictionary were automatically established. The multilingual VepKar corpus is divided into subcorpora according to languages and dialects, and the texts are also classified into styles and genres. The corpus has a sophisticated search system (with filtering of texts by language, style and dialect, by informant, collector or author, by year of recording or year of publication). It is possible to search for lemmas by dialects, parts of speech, grammatical features, and even by lexical-semantic categories. These categories appeared due to the integration of the data of the outstanding “Comparative and Onomasiological Dictionary of the Dialects of the Karelian, Veps and Sami Languages” into the vocabulary part of VepKar. In 2021, the Sanahelmi electronic dictionary was created on the basis of VepKar for Android phones. The development of mobile applications based on corpus data is our bright future.

Highlights

Ключевыеслова: карельский язык; вепсский язык; корпусная лингвистика; Открытый корпус вепсского и карельского языков; корпусный менеджер; словоизменительная парадигма
Large, tagged linguistic corpora with a great variety of genres are used for the preservation and research of safe and endangered languages
The united linguistic platform was named “The Open Corpus of the Veps and Karelian languages” (VepKar). This linguistic corpus includes texts and dictionaries stored in a database, and a computer program for searching and processing the data

Summary

Архитектура и количественные характеристики корпуса

5. Распределение текстов ВепКар по жанрам Fig. 5. 6. Распределение числа текстов ВепКар по годам записи (информантов), дате публикации и добавления в корпус Fig. 6. Постоянное комплектование корпуса подобными материалами способствует популяризации карельского и вепсского языков, а также решению целого ряда просветительских, образовательных и исследовательских задач (не только в области литературоведения, но и в сфере культурологии, лингвистики и др.). Увеличение числа художественных текстов в ВепКаре, равно как и продолжающаяся работа по введению в корпус новых лемм и его разметке, открывает широкие перспективы для исследования языка карелоязычной прозы и поэзии. Развитие подкорпуса художественных текстов может способствовать разработке программ для решения целого спектра прикладных задач, например, реконструкции утраченных слов и частей текстов литературного произведения, выбора вариантов текста из черновиков автора при подготовке не изданных им при жизни текстов и др.

Возможности и приложения корпуса

Мобильные приложения как наступившее лингвистическое будущее

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

THE LINGUISTIC CORPUS VEPKAR IS A LANGUAGE REFUGE FOR THE BALTICFINNISH LANGUAGES OF KARELIA

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Karelian Research Centre of the Russian Academy of Sciences

Lead the way for us

Journal: Proceedings of the Karelian Research Centre of the Russian Academy of Sciences	Publication Date: Jul 28, 2021
License type: cc-by

Similar Papers

Базовая лексика карельского и вепсского языков в лингвогеографическом аспекте
I P Novak
Bulletin of Ugric studies | VOL. 11
I P NovakI P Novak
01 Jan 2020
Bulletin of Ugric studies | VOL. 11

Разработка правил генерации именных словоформ для новописьменных вариантов карельского языка
I P Novak ... N B Krizhanovskaya
Bulletin of Ugric studies | VOL. 10
I P Novak, et. al.I P Novak ... N B Krizhanovskaya
01 Jan 2020
Bulletin of Ugric studies | VOL. 10

Implementing Laravel Framework for E-Commerce: Case Study at Indonesian Farmer Shop Center
Defni ... Tri Lestari
International Journal of Advanced Science Computing and Engineering | VOL. 2
Defni, et. al. Defni ... Tri Lestari
30 Apr 2020
International Journal of Advanced Science Computing and Engineering | VOL. 2

Implementing Laravel Framework for E-Commerce: Case Study at Indonesian Farmer Shop Center
Defni ... Tri Lestari
International Journal of Advanced Science Computing and Engineering | VOL. 2
Defni, et. al. Defni ... Tri Lestari
30 Apr 2020
International Journal of Advanced Science Computing and Engineering | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

THE LINGUISTIC CORPUS VEPKAR IS A LANGUAGE REFUGE FOR THE BALTICFINNISH LANGUAGES OF KARELIA

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Karelian Research Centre of the Russian Academy of Sciences