Abstract

The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and disease NER were 61, 74 and 51%, respectively. Response times ranged from fractions-of-a-second to over a minute per article. We present a description of the challenge and summary of results, demonstrating how curation groups can effectively use interoperable NER technologies to simplify text-mining pipeline implementation.Database URL: http://ctdbase.org/

Highlights

  • The Comparative Toxicogenomic Database (CTD; http:// ctdbase.org) is a publicly available, manually curated resource that promotes understanding of the mechanisms by which drugs and environmental chemicals influence biological processes and human health [1]

  • The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org)

  • The results of Track 3 testing clearly validate the conceptual feasibility of integrating Web service-based NER functionality into asynchronous batch-oriented text-mining pipelines

Read more

Summary

Introduction

The Comparative Toxicogenomic Database (CTD; http:// ctdbase.org) is a publicly available, manually curated resource that promotes understanding of the mechanisms by which drugs and environmental chemicals influence biological processes and human health [1]. CTD’s PhD-level staff biocurators review the scientific literature and manually curate chemical–gene/protein interactions, chemical– disease relationships and gene–disease relationships, using a novel, highly structured notation in conjunction with CTD’s Web-based curation tool [2]. The manual curation process organizes disparate data from scientific publications into a standard structured format, making it more manageable and computable for bioinformatics-related processing. Curated data are captured using publicly available controlled vocabularies. Diseases are represented using CTD’s disease vocabulary, MEDIC [3], which merges OMIM [4] terms with the Disease subset of the National Library of Medicine’s Medical Subject Headings (MeSH) vocabulary [5], genes/proteins are represented using Entrez Gene terms [6], chemicals/drugs are represented using a modified subset of Chemicals and Drugs terms within MeSH [5] and chemical–gene/protein interactions are captured using CTD’s action term vocabulary [1]

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.