Abstract

The information of the Gene Ontology annotation is helpful in the explanation of life science phenomena, and can provide great support for the research of the biomedical field. The use of the Gene Ontology is gradually affecting the way people store and understand bioinformatic data. To facilitate the prediction of gene functions with the aid of text mining methods and existing resources, we transform it into a multi-label top-down classification problem and develop a method that uses the hierarchical relationships in the Gene Ontology structure to relieve the quantitative imbalance of positive and negative training samples. Meanwhile the method enhances the discriminating ability of classifiers by retaining and highlighting the key training samples. Additionally, the top-down classifier based on a tree structure takes the relationship of target classes into consideration and thus solves the incompatibility between the classification results and the Gene Ontology structure. Our experiment on the Gene Ontology annotation corpus achieves an F-value performance of 50.7% (precision: 52.7% recall: 48.9%). The experimental results demonstrate that when the size of training set is small, it can be expanded via topological propagation of associated documents between the parent and child nodes in the tree structure. The top-down classification model applies to the set of texts in an ontology structure or with a hierarchical relationship.

Highlights

  • One of the central purposes of genomics research is to explore the biological functions of the organism

  • It can be argued that these two types of relationship essentially convey certain hierarchies, so we consider that the expression of the current node can be enriched by incorporating the training samples of its ancestor node in the Gene Ontology (GO) structure, which may solve the problem of insufficient positive training samples

  • On determining the set of negative training samples, we focus primarily on the differences between parent and child nodes in the GO structure, in other words, those samples which are associated with the parent nodes of the current node but are not inherited by the current node or its child nodes are selected as negative samples

Read more

Summary

Introduction

One of the central purposes of genomics research is to explore the biological functions of the organism. Their study used ten different genomic data sources in Mus musculus, including protein domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles, and disease data sources In the experiment, they measured the contribution of each data set based on its prediction quality. The hierarchical classifiers trained on multiple data types are based on support vector machine (SVM) and their predicting results are combined in the Bayesian framework to obtain the most probable consistent set of predictions. Their method is capable to implicitly calibrate the SVM margin outputs to probabilities. It can be argued that these two types of relationship essentially convey certain hierarchies, so we consider that the expression of the current node can be enriched by incorporating the training samples of its ancestor node in the GO structure, which may solve the problem of insufficient positive training samples

Method Description
Experimental Results and Analysis
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call