Keyword Extraction from Arabic Text using the Page Rank Algorithm

S Dhanasekar*

doi:10.35940/ijitee.l2614.1081219

Abstract

This paper describes how keywords are extracted from Arabic text using the page rank algorithm, by constructing a graph whose vertices are formed by candidate words that are extracted from the title and the abstract of a given Arabic text after applying a tagging filter to that text. Next, a co-occurrence relation is applied to draw the edges between the vertices within specified window sizes. Then, the page rank algorithm is applied to the graph to rank the importance of each keyword. Finally, the vertices are sorted in descending order by their page rank scores and the tokens with highest scores are chosen as the keywords. Several experiments were conducted on a dataset that consisted of 100 Arabic academic articles for training and 50 for testing. The results were evaluated by using precision, recall, and the F-measure. The maximum recall achieved on the dataset was 63%, as not all the manually identified keywords and keyphrases existed in the article abstracts and titles. The proposed method achieved 25% of recall, which is acceptable as it is comparable to that of a method in the literature that was applied to an English language testing dataset that consisted of 500 English documents, which achieved 42% of recall where the maximum recall percentage of the testing dataset was 78%. Despite the difficulties and challenges in searching for keywords in the Arabic language and using fewer documents in the Arabic testing dataset than in the English, it can be concluded that the proposed keyword and keyphrase extraction system using the page rank algorithm works well.

Full Text