Harnessing online social media to deal with information overload.

Chenliang Li

doi:10.32657/10356/54827

Abstract

In online social media, users become information creators and disseminators through the active interplay between information items and other users, instead of just being information consumers of a decade ago. This kind of information production and dissemination in collaborative and active manner further aggravates the problem of information overload on the World Wide Web (WWW). The existing approaches for information retrieval (IR) and natural language processing (NLP) tasks often offer an intolerable response time for Web users. Moreover, given the numerous interactions between users and information items, new kinds of information needs are emerging, such as opinion mining, event detection and summarization, etc. However, the existing IR technologies (based on bag-of-word model), and NLP technologies (based on the linguistical features), often fail to satisfy the web users in these emerging information needs. On the other hand, people participate in online social media to share stories, photos with their friends, vote and leave opinions, or tag web pages, and so on. The digital footprints of these behaviors make online social media semantic resources which we can exploit to better understand and organize the astronomical information. In this dissertation, we first analyze online social media as multi-dimensional social network by taking Wikipedia as a case study. We find that given the multiple relations exposed from different perspectives in the network, focusing on only one specific relation could lead to biased or even wrong conclusion. Traditional information retrieval approaches are mainly bag-of-word model and keyword based, which ignore the word ordering in the text and measure the relevance based on the presence of the keywords. We propose a generalized framework for word sense disambiguation based on Wikipedia. The proposed framework can enable effective and efficient disambiguation by relating keyphrases (i.e., n-grams) in the documents to their appropriate concepts in Wikipedia, where a concept is defined as a Wikipedia article. The framework is applicable to the documents of different languages with different settings. By adopting the disambiguation method, we could represent a textual document by the concepts it covers based on Wikipedia. We study the semantic tag recommendation task for web pages based on the concept model by exploring the semantic relations between tags and concepts underlying human annotation activities. Web users participate in the information generation process by commenting news articles, sharing stories and publishing opinions by posting microblogs, etc. However, the information generated by users are often short and written with free style, containing grammatical errors, informal abbreviations (e.g., comments, tweets). These adverse features deteriorate the performance of the existing algorithms for many tasks for online social media, such as named entity recognition, event detection, etc. We propose an unsupervised approach for named entity recognition in targeted…

Full Text