A New Method for Extracting Key Terms from Micro-Blogs Messages Using Wikipedia

Ahmad Ali Al-Zubi

doi:10.19026/rjaset.6.3512

Abstract

This study describes how to extract key terms of the micro-blogs messages, using information obtained by analysing the structure and content of online encyclopaedia Wikipedia. The algorithm used for this target is based on the calculation of keyphraseness for each term, i.e., assess the probability that it may be chosen as a key term in the text. During assessment, the developed algorithm has shown satisfactory results in terms of this task, significantly outpacing other existing algorithms. As a demonstration of the possible application of the developed algorithm it has been implemented in a system prototype of contextual advertisement. And some options have been also formulated using the information obtained by analysing Twitter messages, for various support services.

Highlights

Key Terms extraction is vital for Knowledge Management Systems, Information Retrieval Systems and Digital Libraries as well as for general browsing of the web
Conceptual development of blogs is due to their broad socialization, are micro blogs, which have certain characteristics: a limited message length, high frequency of publication, various topics, different ways of delivering messages, etc
Any application based on the values of the weights of terms in the document will be affected. This largely precludes the use of key terms extracting methods that require learning, in systems where where, i = Ordinal number of the term TFi = The frequency of the term in the message dynamic data streams must be processed in real time, Ki = Keyphraseness of the term in Wikipedia

Summary

INTRODUCTION

Key Terms extraction is vital for Knowledge Management Systems, Information Retrieval Systems and Digital Libraries as well as for general browsing of the web. Common models for automating the process of key terms extraction are usually done by using several statistics-based methods such as Bayesian, K-Nearest Neighbor and Expectation-Maximization These models are limited by word-related features that can be used since adding more features will make the models more complex and difficult to comprehend (Arnulfo et al, 2012). It should be noted that the classical statistical methods for the extraction of key terms, based on the analysis of document collections are ineffective in this case (Al-Zubi, 2010) This is due to the extremely small length of messages (up to 140 characters), their wide range of themes and the lack of logical connection between them, as well as an abundance of low use of abbreviations, acronyms and elements of specific micro-syntax. A number of heuristics is applied to the analyzed set of terms, which’s result is a list of terms found to be keys

MATERIALS AND METHODS

RESULTS AND DISCUSSION

CONCLUSION