Abstract

Twitter is a networking micro-blogging service where users post millions of short messages every day. Building multilingual corpora from these microblogs contents can be useful to perform several computational tasks such as opinion mining. However, Twitter data gathering involves the problem of irrelevant included data. Recent literary works have proved that topic models such as Latent Dirichlet Allocation (LDA) are not consistent when applied to short texts like tweets. In order to prune the irrelevant tweets, we investigate in this paper a novel method to improve topics learned from Twitter content without modifying the basic machinery of LDA. This latter is based on a pooling process which combines Information retrieval (IR) approach and LDA.This is achieved through an aggregation strategy based on IR task to retrieve similar tweets in a same cluster. The result of tweet pooling is then used as an input for a basic LDA to overcome the sparsity problem of Twitter content. Empirical results highlight that tweets aggregation based on IR and LDA leads to an interesting improvement in a variety of measures for topic coherence, in comparison to unmodified LDA baseline and a variety of pooling schemes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.