Abstract

BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages: C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http://bioc.sourceforge.net.Database URL: http://bioc.sourceforge.net

Highlights

  • The BioCreative IV Interoperability track [1] addressed the goal of interoperability—a major barrier for wide-scale adoption of available text mining tools

  • The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets

  • We ran the BioC natural language processing (NLP) tools on the NCBI disease corpus and compared the outputs obtained in their native implementation versus the outputs obtained using their BioCcompatible counterparts

Read more

Summary

Introduction

The BioCreative IV Interoperability track [1] addressed the goal of interoperability—a major barrier for wide-scale adoption of available text mining tools. It is straightforward to incorporate BioC code into existing programs to read in data from BioC formatted input files and write out results to BioC formatted output files. As part of this track, the community contributed BioC-formatted data sets and BioC-compliant tools for various useful biomedical natural language processing (NLP) tasks. Our contributions to the interoperability track of the BioCreative IV challenge are BioC text-preprocessing pipelines in Cþþ and Java. Text preprocessing is integral to virtually all NLP systems. It processes the original text into meaningful units that contain important linguistic features

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.