Abstract

Online Arabic content is growing very rapidly, with unmatched growth in Arabic structured resources. Systems that perform standard Natural Language Processing (NLP) tasks such as Named Entity Disambiguation (NED) struggle to deliver decent quality due to the lack of rich Arabic entity repositories. In this paper, we introduce EDRAK, an automatically generated comprehensive Arabic entity-centric resource. EDRAK contains more than two million entities together with their Arabic names and contextual keyphrases. Manual evaluation confirmed the quality of the generated data. We are making EDRAK publicly available as a valuable resource to help advance research in Arabic NLP and IR tasks such as dictionary-based NamedEntity Recognition, entity classification, and entity summarization.

Highlights

  • 1.1 MotivationRich structured resources are crucial for several Information Retrieval (IR) and Natural Language Processing (NLP) tasks; resources quality significantly influence the performance of those tasks

  • Rich structured resources are crucial for several Information Retrieval (IR) and NLP tasks; resources quality significantly influence the performance of those tasks

  • We evaluated all aspects of data generation in EDRAK

Read more

Summary

Introduction

Rich structured resources are crucial for several Information Retrieval (IR) and NLP tasks; resources quality significantly influence the performance of those tasks. Building a dictionary-based Named Entity Recognition (NER) system, requires a comprehensive and accurate dictionary of names (Darwish, 2013; Shaalan, 2014). 1www.internetworldstats.com/stats7.htm structured Arabic content is lagging behind. Wikipedia is one of the main resources from where many modern Knowledge Bases (KB) are extracted. It is heavily used in the literature for IR and NLP tasks. The structured data in the Arabic Wikipedia, such as info boxes, are on average of less quality in terms of coverage and accuracy

Objectives
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.