Abstract

Abstract. Social Networks became a major actor in information propagation. Using the Twitter popular platform, mobile users post or relay messages from different locations. The tweet content, meaning and location, show how an event-such as the bursty one ”JeSuisCharlie”, happened in France in January 2015, is comprehended in different countries. This research aims at clustering the tweets according to the co-occurrence of their terms, including the country, and forecasting the probable country of a non-located tweet, knowing its content. First, we present the process of collecting a large quantity of data from the Twitter website. We finally have a set of 2,189 located tweets about “Charlie”, from the 7th to the 14th of January. We describe an original method adapted from the Author-Topic (AT) model based on the Latent Dirichlet Allocation (LDA) method. We define an homogeneous space containing both lexical content (words) and spatial information (country). During a training process on a part of the sample, we provide a set of clusters (topics) based on statistical relations between lexical and spatial terms. During a clustering task, we evaluate the method effectiveness on the rest of the sample that reaches up to 95% of good assignment. It shows that our model is pertinent to foresee tweet location after a learning process.

Highlights

  • AND STATE OF THE ARTThe exponential growth of available data on the Web enables users to access a large quantity of information

  • We describe an original method adapted from the Author-Topic (AT) model based on the Latent Dirichlet Allocation (LDA) method

  • We propose to build a topic model, called author-topic (AT) (Rosen-Zvi et al, 2004, Morchid et al, 2014b) that takes into consideration all information contained in a tweet: the content itself, the label and the relation between the distribution of

Read more

Summary

Introduction

AND STATE OF THE ARTThe exponential growth of available data on the Web enables users to access a large quantity of information. The classical bag-ofwords approach (Salton and Buckley, 1988) is usually used for text document representation in the context of keyword extraction This method estimates the Term Frequency-Inverse Document Frequency (TF-IDF) of the document terms. Other approaches propose to consider the document as a mixture of latent topics to work around segments of errors These methods build a higher-level representation of Thereby, a document can belong to different topics from a word to another. We propose to build a topic model, called author-topic (AT) (Rosen-Zvi et al, 2004, Morchid et al, 2014b) that takes into consideration all information contained in a tweet: the content itself (words), the label (country) and the relation between the distribution of

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.