Identification of Traffic Information on Twitter Data using Topic Modeling and Entity Recognition

Nuraisa Novia Hidayati ,Putri Damayanti ,Agus Zainal Arifin

doi:10.26418/jlk.v4i1.40

Abstract

Tweet data on several official Twitter accounts from news portals can provide traffic information near real-time, which helps control smooth mobilization. However, the data is mixed with news on current issues, such as government policies and the pandemic situation. For this reason, a news grouping process is needed by finding word vectors through word embedding and inserting them into topic modeling to help separate traffic news from other news. We have compared two well-tested methods when processing Twitter data in various categories: Latent Dirichlet Allocation (LDA) and Non-negative matrix factorization (NMF). In previous research, it appears that the two methods still find the words that compose the topic are quite challenging to interpret. Therefore, we use Word2vec as input to compare the term frequency-inverse document frequency (TF-IDF), which is very commonly used. It is hoped that Word2vec has collected related words and, in turn, will result in a better division of topics. This study shows that the combination of LDA with word vectorization with the Word2vec model presents a coherence value of 0.56 and the term frequency-inverse document frequency (TF-IDF) of 0.57. However, the application of Word2vec to NMF gave better results than TF-IDF. TF-IDF was only able to achieve a coherence value of 0.49 while Word2vec got 0.52. Furthermore, at NMF, the word2vec model can recognize words in the form of locations successfully. When the traffic news has been separated, we applied Named Entity Recognition (NER) to detect the incident's location. We've labeled the location of 30% of the tweet data that has been grouped into training data. This method has successfully detected the location when tested on some other data.

Full Text