Web services-based text-mining demonstrates broad impacts for interoperability and process simplification.

Carolyn J Mattingly,Thomas C Wiegers,Allan Peter Davis

doi:10.1093/database/bau050

Carolyn J Mattingly, Thomas C Wiegers + Show 1 more

Open Access

https://doi.org/10.1093/database/bau050

Copy DOI

Abstract

The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and disease NER were 61, 74 and 51%, respectively. Response times ranged from fractions-of-a-second to over a minute per article. We present a description of the challenge and summary of results, demonstrating how curation groups can effectively use interoperable NER technologies to simplify text-mining pipeline implementation.Database URL: http://ctdbase.org/

Highlights

The Comparative Toxicogenomic Database (CTD; http:// ctdbase.org) is a publicly available, manually curated resource that promotes understanding of the mechanisms by which drugs and environmental chemicals influence biological processes and human health [1]
The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org)
The results of Track 3 testing clearly validate the conceptual feasibility of integrating Web service-based NER functionality into asynchronous batch-oriented text-mining pipelines

Summary

Introduction

The Comparative Toxicogenomic Database (CTD; http:// ctdbase.org) is a publicly available, manually curated resource that promotes understanding of the mechanisms by which drugs and environmental chemicals influence biological processes and human health [1]. CTD’s PhD-level staff biocurators review the scientific literature and manually curate chemical–gene/protein interactions, chemical– disease relationships and gene–disease relationships, using a novel, highly structured notation in conjunction with CTD’s Web-based curation tool [2]. The manual curation process organizes disparate data from scientific publications into a standard structured format, making it more manageable and computable for bioinformatics-related processing. Curated data are captured using publicly available controlled vocabularies. Diseases are represented using CTD’s disease vocabulary, MEDIC [3], which merges OMIM [4] terms with the Disease subset of the National Library of Medicine’s Medical Subject Headings (MeSH) vocabulary [5], genes/proteins are represented using Entrez Gene terms [6], chemicals/drugs are represented using a modified subset of Chemicals and Drugs terms within MeSH [5] and chemical–gene/protein interactions are captured using CTD’s action term vocabulary [1]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Database : the journal of biological databases and curation	Publication Date: Jun 10, 2014
Citations: 27	License type: cc-by

R Discovery Prime

R Discovery Prime

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database : the journal of biological databases and curation

Lead the way for us

Similar Papers

D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information.
Thanh Hai Dang ... Hoang-Quynh Le
Bioinformatics | VOL. 34
Thanh Hai Dang, et. al.Thanh Hai Dang ... Hoang-Quynh Le
30 Apr 2018
Bioinformatics | VOL. 34

Collaborative biocuration--text-mining development task for document prioritization for curation
T C Wiegers ... A P Davis
Database | VOL. 2012
T C Wiegers, et. al.T C Wiegers ... A P Davis
22 Nov 2012
Database | VOL. 2012

Automatic Extraction of Lung Cancer Staging Information From Computed Tomography Reports: Deep Learning Approach.
Danqing Hu ... Xudong Lu
JMIR medical informatics | VOL. 9
Danqing Hu, et. al.Danqing Hu ... Xudong Lu
21 Jul 2021
JMIR medical informatics | VOL. 9

The GNAT library for local and remote gene mention normalization
Jörg Hakenberg ... Martin Gerner
Bioinformatics | VOL. 27
Jörg Hakenberg, et. al.Jörg Hakenberg ... Martin Gerner
03 Aug 2011
Bioinformatics | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database : the journal of biological databases and curation