Abstract

Extracting named entities is an important step for information extraction from a text, based on a given ontology. Dealing with Arabic language invokes an additional number of challenges compared to English, French and other languages within similar families. The major difficulties involve complex morphological systems, no capitalization, and no standardization of Arabic writing. The Arabic language has a rich and complex morphological landscape due to its highly inflected nature. Usually, any Arabic lemma word can be constructed using different internal structure, prefixes and suffixes. Furthermore, there is no standardization of Arabic writing because of the spelling inconsistency of Arabic words. In this work, we propose an operational hybrid approach combining dictionary-based and rule-based detection for extracting seven categories of named entities which are organization by name, date, interval, price/value, percentage, currency and unit. The dictionary-based approach performs exact or approximate matching of the words with prepared Arabic organization names. In case of non-exact matching with the dictionary words, the approximate matching is an efficient solution for morphological difficulties. Specificities of Arabic language are also processed by rule-based detection, which is based on capturing the entities patterns in terms of regular expressions or patterns provided by experts. We evaluated our Arabic name entity recognition system using financial news articles and we obtained around an 80% of recognition rate.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.