Abstract
Mutli-word Terms extraction plays an important role in many Natural Language Processing (NLP) tasks. Despite their major importance, few works were dedicated to Arabic multi-word terms extraction. This paper proposes an automatic Arabic multi-word terms (MWTs) extraction system based on two major filtering steps: linguistics filter using a part-of-speech tagger along with morphological patterns and statistical filter based on probabilistic methods, namely: Log-Likelihood Ratio (LLR) and C-value. We evaluate the performances of the realized systems on Wattan; an Arabic oriented topic newspaper corpus. Our system manages to achieve 90.23% in term of multi-word extraction precision. We also study the use of MWTs as features in Arabic Topic Detection. The conducted experiments show good results.
Highlights
The increasing availability of Arabic electronic documents has led to extensive research efforts covering the Arabic Natural Language Processing (ANLP) various fields, taking in consideration, particularities and complex morphological composition of the Arabic language
This method works on tree steps: first, the selection of the liste of the n-first multi-word terms (MWTs) using the list of the MWTs extracted sorted according to their scores obtained using the LogLikelihood Ratio (LLR) and the C-value
MWTs Extraction system The Multi-Word Terms extraction system allow the extraction of terms composed of 2 to 6 words
Summary
The increasing availability of Arabic electronic documents has led to extensive research efforts covering the Arabic Natural Language Processing (ANLP) various fields, taking in consideration, particularities and complex morphological composition of the Arabic language. Few researches have been undertaken in the field of multi-word terms extraction for Arabic documents. The MWTs extraction task covers detection and extraction of a consecutive set of semantically related words. The technics used in MWTs extraction can be classified into four categories: Statistical approaches based on frequency, probability and co-occurrence measures [7]. Morphological analysis, MWTs boundaries detection and patterns [8]. Hybrid approaches combining statistical and morphological methods [9][10]. The hybrid approaches are wildly used since they combine the benefits of statistical and symbolic methods. Our work is part of the semantic processing of unvowelized Arabic documents and aims to develop a multi-word terms extraction prototype for Arabic texts based on the hybrid approach using lexical patterns and statistical measures: C-value and LLR.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.