Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases

Julien Gobeill,Dina Vishnyakova,Emilie Pasche,Patrick Ruch

doi:10.1093/database/bat041

Abstract

The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based—or dictionary-based—approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists.Database URL: http://eagl.unige.ch/GOCat/

Highlights

The available curated data lag behind current biological knowledge contained in the literature (1, 2)
We begin by briefly describing the resources used for the experiments: the Gene Ontology (GO), the GO Annotations (GOA) database that provided both the knowledge base needed for the machine learning and the benchmarks needed for the evaluation, and the BioCreative I test set that was a supplementary benchmark for our evaluations
We first present the evaluation of the current TB classifier (EAGL) and machine learning (ML) classifier (GOCat), along with GoPubMed, for characterizing the functional profile of 50 abstracts published in 2012

Summary

Introduction

The available curated data lag behind current biological knowledge contained in the literature (1, 2). A large amount of information is generated by research teams and is usually expressed in natural language published in scientific journals; this knowledge needs to be located, integrated and accessed by biologists and curators. In this perspective, text mining solutions could help biologists in keeping up with the literature (3–6). Characterizing the functional profile of a publication, whether it is for triage, for powering ontology-based search engines or integrated in a curation workflow, is one of these promising solutions.

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Database	Publication Date: Jan 1, 2013
Citations: 27	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database

Lead the way for us

Similar Papers

Answering Gene Ontology terms to proteomics questions by supervised macro reading in Medline
Julien Gobeill ... Douglas Teodoro
EMBnet.journal | VOL. 18
Julien Gobeill, et. al.Julien Gobeill ... Douglas Teodoro
09 Nov 2012
EMBnet.journal | VOL. 18

Global analysis of gene function in yeast by quantitative phenotypic profiling
James A Brown ... Nicola M Burrows
Molecular Systems Biology | VOL. 2
James A Brown, et. al.James A Brown ... Nicola M Burrows
01 Jan 2006
Molecular Systems Biology | VOL. 2

A robust data-driven approach for gene ontology annotation.
Y Li ... H Yu
Database : the journal of biological databases and curation | VOL. 2014
Y Li, et. al.Y Li ... H Yu
25 Nov 2014
Database : the journal of biological databases and curation | VOL. 2014

BC4GO: a full-text corpus for the BioCreative IV GO task.
K Van Auken ... H.-M Muller
Database | VOL. 2014
K Van Auken, et. al.K Van Auken ... H.-M Muller
28 Jul 2014
Database | VOL. 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database