Abstract

Arabic Multiword Terms (AMWTs) are relevant strings of words in text documents. Once they are automatically extracted, they can be used to increase the performance of any Arabic Text Mining applications such as Categorization, Clustering, Information Retrieval System, Machine Translation, and Summarization, etc. Mainly the proposed methods for AMWTs extraction can be categorized in three approaches: Linguistic-based, Statistic-based, and hybrid-based approach. These methods present some drawbacks that limit their use. In fact they can only deal with bi-grams terms and their yield not good accuracies. In this paper, to overcome these drawbacks, we propose a new and efficient method for AMWTs Extraction based on a hybrid approach. This latter is composed by two main filtering steps: the Linguistic filter and the Statistical one. The Linguistic Filter uses our proposed Part Of Speech (POS) Tagger and the Sequence identifier as patterns in order to extract candidate AMWTs. While the Statistical filter incorporate the contextual information, and a new proposed association measure based on Termhood and Unithood Estimation named NTC-Value. To evaluate and illustrate the efficiency of our proposed method for AMWTs extraction, a comparative study has been conducted based on Kalimat Corpus and using nine experiment schemes: In the linguistic filter, we used three POS Taggers such as Taani’s method based Rule-approach, HMM method based Statistical-approach, and our recently proposed Tagger based Hybrid –approach. While in the Statistical filter, we used three statistical measures such as C-Value, NC-Value, and our proposed NTC-Value. The obtained results demonstrate the efficiency of our proposed method for AMWTs extraction: it outperforms the other ones and can deal correctly with the tri-grams terms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call