Abstract

This article is dedicated to topic modeling as an unsupervised machine learning technique. It is analyzed how it seems possible to determine the topics of documents in order to categorize them further with the help of topic modeling methods. Such methods as latent semantic analysis, probabilistic latent semantic analysis and latent Dirichlet allocation are considered. An approach that allows the construction of effective topic models of text document collections in Ukrainian and other synthetic languages based on peculiarities of this linguistic language type is proposed, and its main stages are described. The proposed approach consists of a custom input data preprocessing pipeline, which covers file loading, text extraction, removal of improper symbols, tokenization, removal of stop-words, stemming of each token and a newly introduced model pruning stage, which makes any of the modern topic modeling methods applicable for synthetic language topic modeling. The approach was implemented in Python programming language and used to obtain the topic model of the collection of Ukrainian-language scientific publications on civic identity and related topics. An expert in political psychology, who studies the phenomenon of civic identity, was involved in the research for the topic model quality evaluation. As a result of expert evaluation of the topics singled out during the modeling, it was proposed to clarify the formulation of cluster names based on the semantics of the sets of words that form them. In general, according to the expert, the topics singled out represent the concept of the civic identity of an individual and will allow researchers to simplify the work with literature sources on this issue when used to categorize documents. This demonstrates the efficiency of the proposed approach.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.