Filtering Noisy Web Data by Identifying and Leveraging Users' Contributions

Alina Stoica

doi:10.1609/icwsm.v6i1.14295

Alina Stoica

Open Access

PDF Available

https://doi.org/10.1609/icwsm.v6i1.14295

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

In this paper we present several methods for collecting Web textual contents and filtering noisy data. We show that knowing which user publishes which contents can contribute to detecting noise. We begin by collecting data from two forums and from Twitter. For the forums, we extract the meaningful information from each discussion (texts of question and answers, IDs of users, date). For the Twitter dataset, we first detect tweets with very similar texts, which helps avoiding redundancy in further analysis. Also, this leads us to clusters of tweets that can be used in the same way as the forum discussions: they can be modeled by bipartite graphs. The analysis of nodes of the resulting graphs shows that network structure and content type (noisy or relevant) are not independent, so network studying can help in filtering noise.

Full Text