Abstract

The research aims to determine the effectiveness of the thesaurus method for forming a list of topic classes when using machine learning for the topic classification of text materials of sociolinguistic interviews. The paper considers the potential of using machine learning in the topic annotation of linguistic corpus materials. The polytopical nature of the analyzed material is due to its genre belonging to dialogical speech. The hierarchical structure of the topics, identified as a result of a preliminary introspective analysis of the texts, can be described using a thesaurus. The results of using the unsupervised machine learning method are discussed involving two sets of topic class names: a list of topics used in manual text annotation and an extended list of micro-topics whose names were selected from a Russian language thesaurus. The paper is novel in that it is the first to propose the thesaurus method for selecting topic labels for the zero-shot classification of weakly structured Russian texts. The research findings show that using a more detailed lexical description for topic classes improves the classification result.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call