Abstract

BackgroundProtein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue.ResultsWe present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO) terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR), and a kernel based L1-norm regularized logistic regression (KL1LR). In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among a group of protein functions that have the same parent in the GO hierarchy.ConclusionsIn contrast to previous integration methods, our approaches not only increase the prediction quality but also gather information about highly contributing data sources for each protein function. This information can help researchers collect relevant data sources for annotating protein functions.

Highlights

  • Protein function prediction has been one of the most important issues in functional genomics

  • In the case of kernel based L1-norm regularized logistic regression (KL1LR), when the precision at 20% recall (P20R) value was greater than or equal to 0.2, we considered features with regression coefficients located outside the 1 standard deviation (SD) as the highly contributing factors for protein function prediction

  • In KL1LR, it can be seen that the standardization of integrated data has improved the prediction quality and that the prediction accuracy of the AUC value is higher when the relative parameter λ = 0.01 than when λ = 0.1

Read more

Summary

Introduction

Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. Researchers have realized the limitations of knowing the genome sequence information if there is no functional information included For this reason, assigning functions to unknown proteins has become one of the most important topics in functional genomics; experimental methods have identified a number of protein functions with high accuracy, these approaches have required extensive resources in terms of both time and labour. Genomic data sets such as protein domain data, protein-protein interaction, gene expression data, phylogenetic profile, phenotype ontology, disease data, and protein complex data have been extensively used to predict protein functions

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.