Abstract
Online Arabic content is growing very rapidly, with unmatched growth in Arabic structured resources. Systems that perform standard Natural Language Processing (NLP) tasks such as Named Entity Disambiguation (NED) struggle to deliver decent quality due to the lack of rich Arabic entity repositories. In this paper, we introduce EDRAK, an automatically generated comprehensive Arabic entity-centric resource. EDRAK contains more than two million entities together with their Arabic names and contextual keyphrases. Manual evaluation confirmed the quality of the generated data. We are making EDRAK publicly available as a valuable resource to help advance research in Arabic NLP and IR tasks such as dictionary-based NamedEntity Recognition, entity classification, and entity summarization.
Highlights
1.1 MotivationRich structured resources are crucial for several Information Retrieval (IR) and Natural Language Processing (NLP) tasks; resources quality significantly influence the performance of those tasks
Rich structured resources are crucial for several Information Retrieval (IR) and NLP tasks; resources quality significantly influence the performance of those tasks
We evaluated all aspects of data generation in EDRAK
Summary
Rich structured resources are crucial for several Information Retrieval (IR) and NLP tasks; resources quality significantly influence the performance of those tasks. Building a dictionary-based Named Entity Recognition (NER) system, requires a comprehensive and accurate dictionary of names (Darwish, 2013; Shaalan, 2014). 1www.internetworldstats.com/stats7.htm structured Arabic content is lagging behind. Wikipedia is one of the main resources from where many modern Knowledge Bases (KB) are extracted. It is heavily used in the literature for IR and NLP tasks. The structured data in the Arabic Wikipedia, such as info boxes, are on average of less quality in terms of coverage and accuracy
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.