Abstract

with the permission of annotating blog posts with tags, tags has become one of the most important resources used to describe blogger features. However, due to the irregular quality of tags, not all tags are appropriate for representing blogger's preferences. Poor tags or spam tags confuse the actual user's preferences and spam terms, thus they should be detected before they are directly used to tag bloggers. A detailed quantitative analysis on the categories of tag spam in the blogosphere is presented in this paper. Taking advantage of abundant text contents in blog posts and the relatively stable semantic relationship between tags and their target posts, an unsupervised approach based on topic models is proposed to evaluate tag quality for blogger modelling in the blogosphere. The latent interest topics of a blogger are mined out through Latent Dirichlet Allocation (LDA) topic modeling. The blog post of the blogger is represented as a distribution over latent topics and a latent topic is a distribution over words of the vocabulary. A tag is also expressed as a specific co-occurrence term vector. Ultimately, a scheme is devised to determine the similarity between each tag and its target blog post. Then the tags with less similarity value can be identified as poor tag. The experimental results indicate that the proposed method achieves more promising performance than the baselines on datasets collected from Sina Blog, which is one of the biggest Chinese blogs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call