Abstract

Over the past decades, massive amounts of protein-protein interaction (PPI) data have been accumulated due to the advancement of high-throughput technologies, and but data quality issues (noise or incompleteness) of PPI have been still affecting protein function prediction accuracy based on PPI networks. Although two main strategies of network reconstruction and edge enrichment have been reported on the effectiveness of boosting the prediction performance in numerous literature studies, there still lack comparative studies of the performance differences between network reconstruction and edge enrichment. Inspired by the question, this study first uses three protein similarity metrics (local, global and sequence) for network reconstruction and edge enrichment in PPI networks, and then evaluates the performance differences of network reconstruction, edge enrichment and the original networks on two real PPI datasets. The experimental results demonstrate that edge enrichment work better than both network reconstruction and original networks. Moreover, for the edge enrichment of PPI networks, the sequence similarity outperformes both local and global similarity. In summary, our study can help biologists select suitable pre-processing schemes and achieve better protein function prediction for PPI networks.

Highlights

  • Over the past decades, massive amounts of un-annotated protein sequence data have been accumulated with the advancement of high-throughput biological technologies

  • The datasets in this study are based on Gene Ontology (GO) annotation

  • GO annotations consist of three basic namespaces: molecular function, biological process and cellular component

Read more

Summary

Introduction

Massive amounts of un-annotated protein sequence data have been accumulated with the advancement of high-throughput biological technologies. Due to high costs and time-consummation of experimental determining protein function annotation, the proportion of annotated proteins has been still relatively low (Sharan et al, 2007; Barrell et al, 2009). The emerging of available protein databases, such as FATCAT (Ye and Godzik, 2004), PAST (Täubig et al, 2006) and PROCAT (Wallace et al, 1996), has further helped to improve the effectiveness of protein prediction. The low sequence similarity scores often occur when comparing target protein sequences with source protein sequences (Ofran et al, 2005), and this significantly reduces the effective application of homology-based prediction methods

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call