Abstract

ABSTRACTAs the biological datasets accumulate rapidly, computational methods designed to automate protein function prediction are critically needed. The problem of protein function prediction can be considered as a multi-label classification problem resulting in protein functional annotations. Nevertheless, biologists prefer to discover the correlations between protein attributes and functions. We introduce a multi-label supervised topic model into protein function prediction and investigate the advantages of this approach. This topic model can not only work out the function probability distributions over protein instances effectively, but also directly provide the words probability distributions over functions. To the best of our knowledge, this is the first effort to apply a multi-label supervised topic model to the protein function prediction. In this paper, we model a protein as a document and a function label as a topic. First, a set of protein sequences is formalized into a bag of words. Then, we perform inference and estimate the model parameters to predict protein functions. Experimental results on yeast and human datasets demonstrate the effectiveness of this multi-label supervised topic model on protein function prediction. Meanwhile, the experiments also show that this multi-label supervised topic model delivers superior results over the compared algorithms. In summary, the method discussed in this paper provides a new efficient approach to protein function prediction and reveals more information about functions.

Highlights

  • Proteins are a kind of macromolecules and the main component of a cell, and it is the most essential and versatile material of life

  • gene ontology (GO) provides a set of terms for describing genes functions and the relationships between functions, which are classified into three categories: biological process (BP), molecular function (MF) and cellular component (CC) [4]

  • In the ‘Materials and methods’ section, we describe the computational tools and datasets, with a focus on the bag of words (BoW) of protein sequences and a Labelled-latent Dirichlet allocation (LDA) model adopted in this paper; the ‘Results and discussion’ section presents the classification results; we show the usefulness of Labelled-LDA in two protein function prediction experiments: yeast dataset [30] and human dataset; experimental results show that Labelled-LDA delivers superior results over the compared algorithms; the conclusions are drawn

Read more

Summary

Introduction

Proteins are a kind of macromolecules and the main component of a cell, and it is the most essential and versatile material of life. The research on protein functions is of great significance in the development of new drugs, better crops, and even the development of synthetic biochemicals [1]. Protein function annotation has the nature of subjectivity. There are two popular schemes in protein function annotation: FunCat [2]and gene ontology (GO) [3]. On the basis of solid computer science and biological principles, GO is rapidly being regarded as the most common scheme of functional annotation [1]. GO provides a set of terms for describing genes functions and the relationships between functions, which are classified into three categories: biological process (BP), molecular function (MF) and cellular component (CC) [4]. The words ‘function’ and ‘GO term’ are used synonymously

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call