Abstract

The rapidly growing data in many areas, as well as in the biomedical domain, require the assistance of information extraction systems to acquire the much needed knowledge about specific entities such as proteins, drugs, or diseases practically within a short time. Annotated corpora serve the purpose of facilitating the process of building NLP systems. While colossal work has been done in this area for English language, other languages like Arabic seem to lack these resources, especially in the healthcare area. Therefore, in this work, we present a method to develop a silver standard medical corpus for the Arabic language with a dictionary as a minimal supervision tool. The corpus contains 49,856 sentences tagged with 13 entity types corresponding to a subset of UMLS (Unified Medical Language System) concept types. The evaluation of a subset of corpus showed the efficiency of the method used to annotate it with 90% accuracy.

Highlights

  • With the exponential growth of data in many areas, the task of processing the data and extracting useful information from it becomes a necessity.e biomedical domain is no exception

  • With more than 30 million citations of biomedical literature found in PubMed and an endless amount of electronic health records (EHRs), it is hard for researchers and practitioners of the domain to grasp the massive flow of data and get the needed knowledge from it in a practical way within a short time. erefore, information extraction (IE) systems, in which natural language processing (NLP) techniques are used to turn the unstructured text data to an readable, well-structured text [1], are needed

  • Since a considerable portion of biomedical literature is in English language, a significant part of NLP systems and existing corpora were dedicated to this language, where other languages like Arabic reveal a gap in both NLP systems and linguistic resources for the biomedical domain

Read more

Summary

Introduction

With the exponential growth of data in many areas (news and economics), the task of processing the data and extracting useful information from it becomes a necessity. Is paper presents a method to build an annotated biomedical Arabic corpus. (i) e dictionary itself without a corpus can serve as a general medical linguistic resource that can be used to learn an Arabic Named Entity (NE) tagger. (ii) is method uses minimal supervision to annotate the corpus and reduce cost, time, and human effort. (i) e dictionary can be used as seed start to train a minimally supervised classifier for an enhanced annotation of a medical Arabic or English or bilingual corpus or to enrich a general purpose corpus with medical annotations.

Related Work
Corpus Evaluation and Results
Conclusion and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.