Using Topic Modeling and Word Embedding for Topic Extraction in Twitter

Amna Meddeb,Lotfi Ben Romdhane

doi:10.1016/j.procs.2022.09.134

Abstract

Topic analysis (also called topic detection, topic modeling, or topic extraction) is a machine learning technique that organizes and understands large collections of text data, by assigning “tags” or categories according to each individual text's topic or theme. Topic analysis uses natural language processing (NLP) to break down human language into blocks (speech, words, sentences, context) so that you can find patterns and unlock semantic structures within texts to extract insights and help make data-driven decisions. The two most common approaches for topic analysis with machine learning are NLP topic modeling and NLP topic classification. Topic modeling faces several challenges: some are general to the NLP task (as extracting the context of document), and some are specific and are related to the nature (or properties) of the documents. One particular type of documents that raises several challenges are short-text documents that we find in Social Networks. First, we should note that with the growth of online social network platforms and applications, large amounts of textual user-generated content are created daily in the form of comments, reviews, and short-text messages. However, despite their ubiquity, extracting topics from shorts texts remains a difficult task for several reasons. First, unlike traditional normal texts, short texts typically only include a few words. Therefore, directly applying traditional models on short texts will suffer from the severe data sparsity problem (i.e., the sparse word co-occurrence patterns in individual document). Second, the limited contexts make it more difficult to identify the senses of ambiguous words in short texts. Third, in general, the performance of the used machine learning models relies on labeled data. Unfortunately, due to their volume, labeling short-text from social networks remains a tedious and hard task. In this paper, we propose a model for extracting topics in short-texts, and more specifically in Twitter. The key feature of our proposal is the use of word embedding technique for topic modeling, and k-Means clustering for the semi-automatic annotation of tweets. In addition, and unlike most existing approaches, we assign in our model a set of topics to a tweet each with a confidence degree.

Full Text