Abstract

The base of a certain national corpus is not limited to information in the form of meta descriptions and metadata of entered texts. It is also necessary to develop linguistic markup. Linguistic markup is a linguistic information that is cataractarized to each lexical unit in the text according to spelling, phonetic, lexical, grammatical features. However, the development of the linguistic markup of the historical subcorpus is one of the complex tasks that require in-depth research. This is due to the fact that most of the texts included in the historical subcorpus are written in Arabic graphics. And the texts of the Middle Ages written in Arabic graphics were transcribed in different ways. The development of a historical subcorpus has difficulties both theoretically and technically, compared with other subcorpuses. In this regard, the purpose of the article is to consider the issue of linguistic markings for the texts of the historical subcorpus, which are being developed for the first time at the Institute of Linguistics named after Akhmet Baitursynula. Tasks: to identify linguistic, lexical and grammatical markup for transcribed texts; to take into account the experiences of other countries in the development of lexical and grammatical markup; to analyze transcribed texts from Arabic graphics to Cyrillic graphics; to identify the variability of transcribed words; to describe the mechanism of functioning of the lexical and grammatical markup program.The study uses descriptive, historical-comparative, linguotextological, linguostatistical methods. As a result of the study, when developing the markup, the experiments of the development of the historical subcorpus of the Russian language were considered; transcribing texts written in Arabic graphics of different periods of the Middle Ages were anasized; lexical and grammatical markup for transcribed texts were determined; the mechanisms of a lexical and grammatical search system for transcribed texts were described.Practical significance. The development of lexical and grammatical markup for transcribed texts included in the historical subcorpus will be a useful linguistic tool for studying the evolution of a certain lexical unit.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call