Abstract
Abstract. Social Networks became a major actor in information propagation. Using the Twitter popular platform, mobile users post or relay messages from different locations. The tweet content, meaning and location, show how an event-such as the bursty one ”JeSuisCharlie”, happened in France in January 2015, is comprehended in different countries. This research aims at clustering the tweets according to the co-occurrence of their terms, including the country, and forecasting the probable country of a non-located tweet, knowing its content. First, we present the process of collecting a large quantity of data from the Twitter website. We finally have a set of 2,189 located tweets about “Charlie”, from the 7th to the 14th of January. We describe an original method adapted from the Author-Topic (AT) model based on the Latent Dirichlet Allocation (LDA) method. We define an homogeneous space containing both lexical content (words) and spatial information (country). During a training process on a part of the sample, we provide a set of clusters (topics) based on statistical relations between lexical and spatial terms. During a clustering task, we evaluate the method effectiveness on the rest of the sample that reaches up to 95% of good assignment. It shows that our model is pertinent to foresee tweet location after a learning process.
Highlights
AND STATE OF THE ARTThe exponential growth of available data on the Web enables users to access a large quantity of information
We describe an original method adapted from the Author-Topic (AT) model based on the Latent Dirichlet Allocation (LDA) method
We propose to build a topic model, called author-topic (AT) (Rosen-Zvi et al, 2004, Morchid et al, 2014b) that takes into consideration all information contained in a tweet: the content itself, the label and the relation between the distribution of
Summary
AND STATE OF THE ARTThe exponential growth of available data on the Web enables users to access a large quantity of information. The classical bag-ofwords approach (Salton and Buckley, 1988) is usually used for text document representation in the context of keyword extraction This method estimates the Term Frequency-Inverse Document Frequency (TF-IDF) of the document terms. Other approaches propose to consider the document as a mixture of latent topics to work around segments of errors These methods build a higher-level representation of Thereby, a document can belong to different topics from a word to another. We propose to build a topic model, called author-topic (AT) (Rosen-Zvi et al, 2004, Morchid et al, 2014b) that takes into consideration all information contained in a tweet: the content itself (words), the label (country) and the relation between the distribution of
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.