Abstract

Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/

Highlights

  • Relations between chemicals and diseases (ChemicalDisease Relations or Chemical Disease Relation (CDR)) play critical roles in drug discovery, biocuration, drug safety, etc. [1]

  • To ensure we have some unseen data for the task participants, the remaining 100 articles of the test set were annotated during the challenge and their curation was not made public until the BioCreative V challenge was complete

  • We developed a corpus for both named entity recognition and chemical-disease relations in the literature

Read more

Summary

Introduction

Relations between chemicals and diseases (ChemicalDisease Relations or CDRs) play critical roles in drug discovery, biocuration, drug safety, etc. [1]. Relations between chemicals and diseases (ChemicalDisease Relations or CDRs) play critical roles in drug discovery, biocuration, drug safety, etc. Due to the high cost of manual curation and rapid growth of the biomedical literature, several attempts have been made to assist curation using text-mining systems [4,5] including the automatic extraction of CDRs [6]. These attempts have met with limited success, due in part to the lack of a large-scale training corpus. The challenge included two subtasks: disease named entity recognition (DNER) task and chemical-induced disease (CID) relation extraction task

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call