Microblog Keyphrase Extraction Based on Similarity Features

He Yan Huang,Li Zi Liao

doi:10.2991/icacsei.2013.71

Abstract

This paper proposes to extract keyphrases from microblog based on similarity features. We analyze a large number of microblogs and find an interesting phenomenon that people use various nugget phrases to express the same factoid while many of these nugget phrases show similarity relationships. We propose a similarity features based context-sensitive topical PageRank method for keyphrase ranking. We evaluate our proposed methods on a large microblog dataset. Experiments show that our system is very effective for keyphrase extraction. Index Terms - Keyphrase extraction, Similarity features, Graph- based model. I. Introduction As a broadcast medium for broadcasting short informal messages, microblog has rapidly become popular. Though microblogs are quite noisy and informal, they provide a unique compilation of first-hand information about people's opinions and feelings. Analyzing such up-to-date and tremendous amount of information can be really helpful in domain like monitoring public opinion, crisis public relations. While keyphrases are very efficient in summarizing microblog content, extracting keyphrases from microblog is starting to receive more attention. To extract a very small amount of representative keyphrases from a large microblog set is quite challenging. Due to the numerous, changeable and noisy nature of microblogs, unsupervised approach seems to be more appropriate to analyze it. Supervised methods always need a microblogs set with human-assigned keyphrases as training set. As mentioned above, microblogs increase exponentially and change rapidly. It is unpractical to label training dataset by human time to time to meet such need. Thus, we propose an unsupervised approach in this study. Most existing keyphrase extraction algorithms focused on popular formal domains, such as papers or web pages. When applied to microblog, an extremely informal domain, their performance drops sharply. Compared with traditional text collections, keyphrase extraction from microblog is more challenging in several aspects. At first, microblogs are much shorter than traditional texts and not all microblogs contain useful information. Secondly, microblogs are written by a wide variety of users. Thus, microblogs about the same event or even microblogs containing the same meaning may have total different form of expression. In this paper, we analyze a large number of microblogs and find an interesting phenomenon. When an event or a topic occurs, people tend to use various nugget phrases to refer to it. Therefore, widely used features like position, Term Frequency, TFIDF could not be very efficient to extract keyphrase. At the same time, we also find that though forms of phrases are different, their contexts or head nouns are somewhat similar. In our work, we propose two kinds of features to capture this phenomenon. One is context similarity of candidate phrases. The other is inner similarity, which calculates the similarity of head nouns. For keyphrase extraction, there are standard three steps, namely, keyword ranking, candidate keyphrase generation and keyphrase ranking. When it applied to noisy microblog, the performance is affected. We propose to directly rank candidate phrases after a preprocessing procedure. We modify the context-sensitive topical PageRank method by introducing similarity features (1). We find that preprocessing keyphrases using similarity features before ranking can largely help boost the performance. Section 2 surveys related literatures on keyphrase extraction. Section 3 describes the interesting phenomenon we find and outlines our proposed extraction system. Section 4 presents experiments on the effectiveness of our system compared to baselines.

Full Text