Abstract

Automatic aggregation of similar words into semantically related groups (or clusters) is of interest to many natural language processing (NLP) applications. Extracting semantically related words and quasi-synonyms from text is a relatively new research area for the under-resourced Arabic language. Previous attempts addressed the problem of single-word term extraction. However, the absence of multiword terms (MWTs) dictionary, ontology, and semantic network makes extracting and identifying high-quality MWTs for the Arabic language a challenging problem. The main goal of this study is to extract corpus-based and coherent MWTs in the form of bigram and trigram sequences as an adequate representation of syntactic and semantic clusters. Therefore, this study contributes to this problem by implementing an algorithm named SEWAR, which uses the FastText algorithm to extract high-quality MWTs from a medical health corpus in Arabic. If put into practice, SEWAR can provide a suitable and helpful solution to classify and direct medical health questions to proper medical practitioners without human intervention. In addition, SEWAR can be applied to plenty of NLP tasks, such as information retrieval, question answering, and text summarization. Three metrics were used to assess the extracted MWTs; the pointwise mutual information (PMI), the cosine similarity, and the clustering purity measure. The results were promising and encouraging to generalize and apply SEWAR to extract MWTs from any Arabic corpus.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call