Evaluation of latent dirichlet allocation for document organization in different levels of semantic complexity

Roberta A Sinoara,Solange O Rezende,Ricardo B Scheicher

doi:10.1109/ssci.2017.8280939

Abstract

In the last years, latent Dirichlet allocation (LDA), a state-of-the-art topic modeling method, has been applied in several text mining tasks. LDA solutions can be used as either a clustering solution or a low-dimensional document representation. The low-dimensional space obtained by LDA is normally called semantic space, as alternative forms expressing the same concept or topic are projected to a common representation. In this work, we discuss the problem of document organization in different levels of semantic complexity and evaluate the use of LDA for document organization in a real-world application scenario. Our hypothesis is that LDA achieves good results when documents can be organized based on the vocabulary only, however, LDA semantic space is not semantically rich enough to allow an organization in more complex scenarios. We developed a proof of concept to give evidence to this hypothesis and evaluate the use of LDA in two different approaches: as an exclusive partitional clustering solution and as a dimension reduction method. The solutions were evaluated based on both FScore measure and user's expectations, considering document organization problems with different levels of semantic requirements. The results indicate that LDA reached good FScore if the organization depends mainly on the document vocabulary, but the method was not able to help the discovery of patterns that is semantically more complex.

Full Text