BioC: a minimalist approach to interoperability for biomedical text processing

D C Comeau,M Krallinger,Z Lu,P Ciccarese,M Torii,W J Wilbur,Y Peng,F Leitner,K Verspoor,A Valencia,T C Wiegers,C H Wu,R Islamaj Dogan,K B Cohen,F Rinaldi

doi:10.1093/database/bat064

Abstract

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/.Database URL: http://bioc.sourceforge.net/

Highlights

With the proliferation of natural language text, text mining has emerged as an important research area
How ‘little’ can one do to obtain interoperability? We provide an extensible markup language (XML) document type definition (DTD) defining ways in which a document can contain text, annotations and relations
We minimize the investment needed by a developer to use our approach; we provide data classes to hold documents in memory and connector classes to read/write the XML documents into/out of the data classes

Summary

Introduction

With the proliferation of natural language text, text mining has emerged as an important research area. We describe in detail the BioC XML format and how it can be used to share text documents and to allow a large number of different annotations relevant for biomedical research to be represented.

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Database	Publication Date: Sep 18, 2013
Citations: 155	License type: cc-by

R Discovery Prime

R Discovery Prime

BioC: a minimalist approach to interoperability for biomedical text processing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database

Lead the way for us

Similar Papers

From Natural Language Text to Visual Models: A survey of Issues and Approaches
Cristina-Claudia OSMAN ... Paula-Georgiana ZALHAN
Informatica Economica | VOL. 20
Cristina-Claudia OSMAN, et. al.Cristina-Claudia OSMAN ... Paula-Georgiana ZALHAN
30 Dec 2016
Informatica Economica | VOL. 20

Using Continuous Integration to organize and monitor the annotation process of domain specific corpora
Marc Schreiber ... Bodo Kraft
-
Marc Schreiber, et. al.Marc Schreiber ... Bodo Kraft
01 Apr 2014
01 Apr 2014

Semantic Information Retrieval: A Comparative Experimental Study of NLP Tools and Language Resources for Arabic
Nadia Soudani ... Ibrahim Bounhas
-
Nadia Soudani, et. al.Nadia Soudani ... Ibrahim Bounhas
01 Nov 2016
01 Nov 2016

NCBI disease corpus: A resource for disease name recognition and concept normalization
Rezarta Islamaj Doğan ... Zhiyong Lu
Journal of Biomedical Informatics | VOL. 47
Rezarta Islamaj Doğan, et. al.Rezarta Islamaj Doğan ... Zhiyong Lu
03 Jan 2014
Journal of Biomedical Informatics | VOL. 47

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

BioC: a minimalist approach to interoperability for biomedical text processing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database