MS-k NN: protein function prediction by integrating multiple data sources

Liang Lan,Yuhong Guo,Slobodan Vucetic,Nemanja Djuric

doi:10.1186/1471-2105-14-s3-s8

Liang Lan, Yuhong Guo + Show 2 more

Open Access

https://doi.org/10.1186/1471-2105-14-s3-s8

Copy DOI

Journal: BMC bioinformatics	Publication Date: Feb 1, 2013
Citations: 89	License type: CC BY 2.0

Affiliation: Temple University

Abstract

BackgroundProtein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source k-Nearest Neighbor (MS-kNN) algorithm for function prediction, which finds k-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions.ResultsWe report the results in the context of 2011 Critical Assessment of Function Annotation (CAFA). Prior to CAFA submission deadline, we evaluated our algorithm on 1,302 human test proteins that were represented in all 3 data sources. Using only the sequence similarity information, MS-kNN had term-based Area Under the Curve (AUC) accuracy of Gene Ontology (GO) molecular function predictions of 0.728 when 7,412 human training proteins were used, and 0.819 when 35,622 training proteins from multiple eukaryotic and prokaryotic organisms were used. By aggregating predictions from all three sources, the AUC was further improved to 0.848. Similar result was observed on prediction of GO biological processes. Testing on 595 proteins that were annotated after the CAFA submission deadline showed that overall MS-kNN accuracy was higher than that of baseline algorithms Gotcha and BLAST, which were based solely on sequence similarity information. Since only 10 of the 595 proteins were represented by all 3 data sources, and 66 by two data sources, the difference between 3-source and one-source MS-kNN was rather small.ConclusionsBased on our results, we have several useful insights: (1) the k-nearest neighbor algorithm is an efficient and effective model for protein function prediction; (2) it is beneficial to transfer functions across a wide range of organisms; (3) it is helpful to integrate multiple sources of protein information.

Highlights

Protein function determination is a key challenge in the post-genomic era
In an attempt to address some of the identified challenges and faced with the tight deadline of 2011 Critical Assessment of Function Annotations (CAFA), we focused our attention on the k-nearest neighbor approach for function prediction proposed in [15]
CAFA results Algorithm selected for CAFA By considering the results presented above, we observed that lin-sim k-nearest neighbor (kNN) classifier improves prediction performance only slightly, while it is computationally costly and sensitive to the lin-sim threshold choice

Summary

Introduction

Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. The sequence alignment-based function inference is the most widely used form of computational function prediction [1] These approaches use sequence comparison tools, such as BLAST [2], to search annotated databases for the most similar proteins to the query protein based on sequence and transfer their functions. Gotcha [3] is a similar method that takes sequence alignment scores between a query protein and a database of functionally annotated proteins, and overlays them on functional ontology, cumulatively propagating the scores towards the root of the ontology. Both the BLAST and Gotcha approaches were used as baselines in 2011 CAFA

Methods

Results

Conclusion