Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus

D C Comeau,H Liu,R Islamaj Do An,W J Wilbur

doi:10.1093/database/bau056

Abstract

BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages: C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http://bioc.sourceforge.net.Database URL: http://bioc.sourceforge.net

Highlights

The BioCreative IV Interoperability track [1] addressed the goal of interoperability—a major barrier for wide-scale adoption of available text mining tools
The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets
We ran the BioC natural language processing (NLP) tools on the NCBI disease corpus and compared the outputs obtained in their native implementation versus the outputs obtained using their BioCcompatible counterparts

Summary

Introduction

The BioCreative IV Interoperability track [1] addressed the goal of interoperability—a major barrier for wide-scale adoption of available text mining tools. It is straightforward to incorporate BioC code into existing programs to read in data from BioC formatted input files and write out results to BioC formatted output files. As part of this track, the community contributed BioC-formatted data sets and BioC-compliant tools for various useful biomedical natural language processing (NLP) tasks. Our contributions to the interoperability track of the BioCreative IV challenge are BioC text-preprocessing pipelines in Cþþ and Java. Text preprocessing is integral to virtually all NLP systems. It processes the original text into meaningful units that contain important linguistic features

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Database	Publication Date: Jun 16, 2014
Citations: 9	License type: cc-by

R Discovery Prime

R Discovery Prime

Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database

Lead the way for us

Similar Papers

NCBI disease corpus: A resource for disease name recognition and concept normalization
Rezarta Islamaj Doğan ... Zhiyong Lu
Journal of Biomedical Informatics | VOL. 47
Rezarta Islamaj Doğan, et. al.Rezarta Islamaj Doğan ... Zhiyong Lu
03 Jan 2014
Journal of Biomedical Informatics | VOL. 47

Cost-efficient quality assurance of natural language processing tools through continuous monitoring with continuous integration
Marc Schreiber ... Bodo Kraft
-
Marc Schreiber, et. al.Marc Schreiber ... Bodo Kraft
14 May 2016
14 May 2016

From Natural Language Text to Visual Models: A survey of Issues and Approaches
Cristina-Claudia OSMAN ... Paula-Georgiana ZALHAN
Informatica Economica | VOL. 20
Cristina-Claudia OSMAN, et. al.Cristina-Claudia OSMAN ... Paula-Georgiana ZALHAN
30 Dec 2016
Informatica Economica | VOL. 20

Natural Language Processing and the Promise of Big Data: Small Step Forward, but Many Miles to Go.
Thomas M Maddox ... Michael A Matheny
Circulation. Cardiovascular quality and outcomes | VOL. 8
Thomas M Maddox, et. al.Thomas M Maddox ... Michael A Matheny
18 Aug 2015
Circulation. Cardiovascular quality and outcomes | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database