Think globally and solve locally: secondary memory-based network learning for automated multi-species function prediction.

Marco Mesiti,Matteo Re,Giorgio Valentini

doi:10.1186/2047-217x-3-5

Abstract

BackgroundNetwork-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple off-the-shelf computers.ResultsWe propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins.ConclusionsThe combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines.

Highlights

Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations
We compared the runtime required by main memory and secondary memory-based implementations (Section “Secondary memory-based computation”) of the random walk (RW) algorithm described in Section “Local implementation of network-based algorithms”
Even if our main aim consists in showing that the combination of local implementation and secondary memory-based computation allows the analysis of big biological networks on small computers, we performed a comparison of the performance achieved with single-species and multi-species networks of bacteria proteins to experimentally assess the impact of a multi-species approach to the prediction of protein functions

Summary

Introduction

Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A recent example is represented by the Integrative multi-species prediction (IMP) web server [4] which integrates prior knowledge and data collections from multiple organisms for the generation of novel functional working hypotheses used in experimental followup. IMP covers only seven model organisms, preventing its application to the prediction of the functions of proteins belonging to the proteomes of poorly annotated organisms

Methods

Results

Conclusion