Extracting Turkish tweet topics using LDA

Fahriye Gemci,Kadir A Peker

doi:10.1109/eleco.2013.6713899

Abstract

Social media is a very popular medium of communication for sharing people's activities, opinions and feelings with others. Twitter has become one of the most popular of these social media services. Finding relevant information in the large space of Twitter is a challenge. Various algorithms have been developed to find related tweets. Extracting tweet topics is one of the techniques that can be used for this purpose. Recently, LDA (Latent Dirichlet Allocation) has been successfully used in analysis of tweet topics in English. However, LDA hasn't been tried on Turkish tweets. Turkish is an agglutinative language, which makes application of LDA a new challenge compared to LDA on English tweets. A series of preprocessing steps like stemming, stop word elimination, cleaning of punctuations and spelling errors etc. to make the tweet texts suitable for LDA analysis are used in our application. 6 6.000 tweets and 4.000 control tweets with Twitter4j library are crawled [2]. Zemberek library is used for stemming [3]. The language used in tweets - with its own rules, wide spread spelling errors or made-up words - make text analysis very difficult. Some of these problems are solved by adding new words into Zemberek library and some of the problematic words are completely removed. After preprocessing, LDA i s performed and 40 topics are extracted. The initial results look promising. Some significant topics such as football teams, celebrity names or hot news topics are detected. Using these results, automatic recommendation of relevant tweets will be performed.

Full Text