Abstract

One of the most important trends in modern dialectological science is creating new electronic resources. The article gives an overview of Russian resources of this kind. Among them dialectal corpora hold a special place. The author of the article focuses on the Tomsk Dialect Corpus, which today includes more than 1,700,000 tokens. This resource is unparalleled in Russian scientific practice. It is designed as a universal information retrieval system which includes three modules: 1) textual, 2) grammatical, 3) lexicographic. The aim of the lexicographic component is to provide definitions of dialect lexemes. To do this, it is proposed to use the Dictionary of Russian Old-Timers’ Dialects of the Middle Part of the River Ob Basin (1964–1967) edited by V.V. Palagina and two supplements to it (1975, 1983–1986). The phases of the implementation of the lexicographic module into the Tomsk Dialect Corpus are described. The first phase was the automatic recognition of the above-mentioned paper dictionary. The second stage is editing the dictionary. The principles of editing the source material are determined by the fact that the lexicographic component is considered as part of a universal electronic system. Two basic editing principles are: the possibility to process a word automatically and the autonomous functioning of each dictionary entry. In accordance with them, the vocabulary and the structure of the dictionary entry were formed. At the stage of forming the vocabulary, some dictionary entries (for example, two-word ones) were discarded. The structure of the dictionary entry contains the main areas: headword, definition and contexts. One of the main editing tasks is to combine dictionary entries from different volumes of the dictionary into one. These words are marked either as homonyms, or as the meanings of one word. Examples of dictionary entries before and after editing are presented in the article. By now, about a half of the original vocabulary has been processed (letters from A to M, 12,450 entries). The final version of the electronic dictionary as part of the Tomsk Dialect Corpus is planned to be presented on the website of the Laboratory of General and Siberian Lexicography (http://losl.tsu.ru/) by June 2021. The prospects of the project include, firstly, the expansion of the vocabulary, and secondly, the implementation of search by dictionary labels (diminutives, augmentative, etc.) into the corpus. The presented solutions can be used in the development of other dialect corpora.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.