Question Tags or Text for Topic Modeling: Which is better

Sneh Prabha,Neetu Sardana

doi:10.1016/j.procs.2023.01.193

Sneh Prabha, Neetu Sardana

Open Access

https://doi.org/10.1016/j.procs.2023.01.193

Copy DOI

Abstract

Topic modelling is a probabilistic based statistical model used to find the latent topics that best depicts the content of the documents. Community Question Answering websites such as Quora, Stack Overflow and Yahoo! Answers have been prevalently in use, performs topic modeling as lot of queries pour in on daily basis which make it challenging to understand, summarize and synthesize the main topic of discussions. On these websites there are basically two sources of information that are available to analyze the key latent topics: questions text and tags. Questions are in textual format and tags are the keywords or tokens that are related to the question being asked which describes the content of the question. In past studies, most of the researchers have used question text for the purpose of topic modeling. It is still unclear why tag is not being considered for topic modeling. To combat this issue, this paper performs topic modeling using both question tags and text. The topic modeling based on tags has been compared with text based on two metrics namely coherence and perplexity. Experiment has been conducted on three real time datasets namely Artificial intelligence, Software Engineering and quantum computing from Stack exchange website. At high level tag-based topic modelling looked promising but closer observation revealed the opposite. It has been found that topic modeling using question text is preferable as topic modelling using tags collapses after a certain number of topics.

Full Text