The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions.

Rezarta Islamaj Doğan,Kara Dolinski,Andrew Chatr-Aryamontri,Rose Oughtred,Sun Kim,Mike Tyers,Jennifer Rust,Donald C Comeau,W John Wilbur,Christie S Chang

doi:10.1093/database/baw147

Abstract

A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report.Database URL: http://bioc.sourceforge.net/BioC-BioGRID.html

Highlights

BioCreative (Critical Assessment of Information Extraction in Biology) [1,2,3,4] is a collaborative initiative to provide a common evaluation framework for monitoring and assessing the state-of-the-art of text-mining systems applied to biologically relevant problems
The goal of the BioCreative challenges has been to pose tasks that will result in systems capable of scaling for use by general biology researchers and more specialized end users such as database curators
The task was positioned as a collaboration rather than a competition such that participating teams created complementary modules that could be seamlessly integrated into a system capable of assisting BioGRID curators

Summary

Introduction

BioCreative (Critical Assessment of Information Extraction in Biology) [1,2,3,4] is a collaborative initiative to provide a common evaluation framework for monitoring and assessing the state-of-the-art of text-mining systems applied to biologically relevant problems. An important contribution of BioCreative challenges has been the generation of shared gold standard datasets, prepared by domain experts, for the training and testing of text-mining applications. These collections, and the associated evaluation methods, represent an important resource for continued development and improvement of text-mining applications. The resulting interactive system triaged sentences from full text articles in order to identify text passages associated with protein–protein and genetic interactions (abbreviated PPI and GI, respectively). These sentences were highlighted in the biocurator assistant viewer [13]. World-wide, developed one or more modules independently [13,14,15,16,17,18], integrated via BioC, to insure the interoperability of the different systems [11]

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Database	Publication Date: Jan 1, 2017
Citations: 26	License type: cc-by

R Discovery Prime

R Discovery Prime

The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database

Lead the way for us

Similar Papers

Bayesian network model for identification of pathways by integrating protein interaction with genetic interaction data
Changhe Fu ... Zu-Guo Yu
BMC Systems Biology | VOL. 11
Changhe Fu, et. al.Changhe Fu ... Zu-Guo Yu
01 Sep 2017
BMC Systems Biology | VOL. 11

Quantitative Genetic Interactions Reveal Biological Modularity
Pedro Beltrao ... Nevan J Krogan
Cell | VOL. 141
Pedro Beltrao, et. al.Pedro Beltrao ... Nevan J Krogan
01 May 2010
Cell | VOL. 141

Automated identification of pathways from quantitative genetic interaction data
Alexis Battle ... Peter Walter
Molecular Systems Biology | VOL. 6
Alexis Battle, et. al.Alexis Battle ... Peter Walter
01 Jan 2009
Molecular Systems Biology | VOL. 6

The BioGRID interaction database: 2017 update.
Andrew Chatr-Aryamontri ... Nadine K Kolas
Nucleic Acids Research | VOL. 45
Andrew Chatr-Aryamontri, et. al.Andrew Chatr-Aryamontri ... Nadine K Kolas
14 Dec 2016
The BioGRID interaction database: 2017 update.
Andrew Chatr-Aryamontri ... Nadine K Kolas

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database