Fair evaluation of global network aligners.

Joseph Crawford,Tijana Milenković,Yihan Sun

doi:10.1186/s13015-015-0050-8

Abstract

BackgroundAnalogous to genomic sequence alignment, biological network alignment identifies conserved regions between networks of different species. Then, function can be transferred from well- to poorly-annotated species between aligned network regions. Network alignment typically encompasses two algorithmic components: node cost function (NCF), which measures similarities between nodes in different networks, and alignment strategy (AS), which uses these similarities to rapidly identify high-scoring alignments. Different methods use both different NCFs and different ASs. Thus, it is unclear whether the superiority of a method comes from its NCF, its AS, or both. We already showed on state-of-the-art methods, MI-GRAAL and IsoRankN, that combining NCF of one method and AS of another method can give a new superior method. Here, we evaluate MI-GRAAL against a newer approach, GHOST, by mixing-and-matching the methods’ NCFs and ASs to potentially further improve alignment quality. While doing so, we approach important questions that have not been asked systematically thus far. First, we ask how much of the NCF information should come from protein sequence data compared to network topology data. Existing methods determine this parameter more-less arbitrarily, which could affect alignment quality. Second, when topological information is used in NCF, we ask how large the size of the neighborhoods of the compared nodes should be. Existing methods assume that the larger the neighborhood size, the better.ResultsOur findings are as follows. MI-GRAAL’s NCF is superior to GHOST’s NCF, while the performance of the methods’ ASs is data-dependent. Thus, for data on which GHOST’s AS is superior to MI-GRAAL’s AS, the combination of MI-GRAAL’s NCF and GHOST’s AS represents a new superior method. Also, which amount of sequence information is used within NCF does not affect alignment quality, while the inclusion of topological information is crucial for producing good alignments. Finally, larger neighborhood sizes are preferred, but often, it is the second largest size that is superior. Using this size instead of the largest one would decrease computational complexity.ConclusionTaken together, our results represent general recommendations for a fair evaluation of network alignment methods and in particular of two-stage NCF-AS approaches.Electronic supplementary materialThe online version of this article (doi:10.1186/s13015-015-0050-8) contains supplementary material, which is available to authorized users.

Highlights

Analogous to genomic sequence alignment, biological network alignment identifies conserved regions between networks of different species
(3) How large the size of network neighborhoods of compared nodes to consider within node cost function (NCF) (“The size of nodes’ neighborhoods within NCF?”)? In addition, we comment on relationships between different alignment quality measures (“Relationships between different alignment quality measures”)
GHOST’s alignment strategy (AS) is superior to MI-GRAAL’s AS (Figure 4a, b). These findings are based on all alignments for all values of α, all neighborhood sizes, and all measures of alignment quality combined (“Aligners resulting from combining existing NCFs and ASs, and their parameters”), which might not be fair

Summary

Introduction

Analogous to genomic sequence alignment, biological network alignment identifies conserved regions between networks of different species. Network alignment typically encompasses two algorithmic components: node cost function (NCF), which measures similarities between nodes in different networks, and alignment strategy (AS), which uses these similarities to rapidly identify high-scoring alignments. We ask how much of the NCF information should come from protein sequence data compared to network topology data Existing methods determine this parameter more-less arbitrarily, which could affect alignment quality. Methods for global network alignment (GNA) have been proposed, which aim to optimize global similarity between different networks and can find large conserved subgraphs [2, 3, 5,6,7, 9, 20,21,22,23,24,25,26,27,28,29,30,31]. We focus on one-to-one GNA due to its recent popularity [2, 3, 31], but all concepts and ideas can be applied to one-to-many or many-to many GNA, as well as to LNA

Objectives

Methods

Results

Conclusion