Topic modelling with morphologically analyzed vocabularies

Marcus Spies

doi:10.5937/spsunp1701001s

Abstract

Probabilistic topic modeling is a text mining technique that allows to extract sets of term probability distributions which can intuitively be interpreted as latent topics. The extraction in most techniques uses only document term frequency matrices as input data. Moreover, topic models estimate posterior document-topic distributions useful for intelligent document retrieval query processing. This paper discusses two approaches to topic modeling involving Dirichlet distributions and Dirichlet processes. However, these and related approaches presume suitable text preprocessing in order to keep parameter spaces for estimations from training text corpora at manageable sizes. In the present paper, we discuss the influence of morphological preprocessing of training texts. Morphological analysis is a computer linguistic discipline that allows to decompose observed terms into base lemmata. This is effected by a deep analysis of the observed terms as opposed to straightforward prefix or postfix elimination used in conventional stemming algorithms. Morphological preprocessing is especially effective in inflection rich languages like, e.g. Finnish or German, and effectively reduces the training vocabulary size. In addition, morphological preprocessing allows for decomposing compound words. It is of considerable interest to study the influence of morphological preprocessing on text mining and statistical topic models. In experiments reported in the application section of this paper, significant changes of the frequency structure of document term matrices were found. Interestingly, these changes also led to substantial improvements in model quality indicators of topic models due to morphological preprocessing. Steps for further research are suggested in the concluding section.

Full Text