Abstract

The paper deals with the current status of preparation of Slavonic historical textual corpora and requirements for them from the point of view of processing, search and demonstration of linguistic data. It is stressed that the main causes of the slow development of this line are high labor expenditures of manual creation of machine-readable transcriptions and their tagging and the necessity of training of special corpus managers providing access to data and its visualization. It is emphasized that one of the lines of use of corpus data of current importance is its analysis with the help of quantitative and statistic methods. There is a description of the functional possibilities of the historical corpus “Manuscript” comprising medieval Slavonic manuscripts of the 10th — 15th centuries (manuscripts.ru). The possibilities of the module of n-grams for revelation of grammatically and semantically set expressions characterizing the text subjects are demonstrated on the example of subcorpus of three Old Russian chronicles (Laurentian, Hypatian, Radzivilovsky). Statistic measures Mutual Information and T-score help to reveal the lists of relatively rare and more frequent set expressions. MI-lists include proper names, pair names, set biblical and Slavonic-bookish subordinating constructions. T-score lists give information on the events, goals, persons, outputs and their characteristics. A conclusion on the efficiency of application of statistic measures to automatic finding of the semantically and thematically important expressions in the historical sources is made.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.