Unrooted unordered homeomorphic subtree alignment of RNA trees

Nimrod Milo,Yefim Dinitz,Eitan Bachmat,Erez Katzenelson,Shay Zakov,Michal Ziv-Ukelson

doi:10.1186/1748-7188-8-13

Abstract

We generalize some current approaches for RNA tree alignment, which are traditionally confined to ordered rooted mappings, to also consider unordered unrooted mappings. We define the Homeomorphic Subtree Alignment problem (HSA), and present a new algorithm which applies to several modes, combining global or local, ordered or unordered, and rooted or unrooted tree alignments. Our algorithm generalizes previous algorithms that either solved the problem in an asymmetric manner, or were restricted to the rooted and/or ordered cases. Focusing here on the most general unrooted unordered case, we show that for input trees T and S, our algorithm has an O(nTnS + min(dT,dS)LTLS) time complexity, where nT,LT and dT are the number of nodes, the number of leaves, and the maximum node degree in T, respectively (satisfying dT ≤ LT ≤ nT), and similarly for nS,LS and dS with respect to the tree S. This improves the time complexity of previous algorithms for less general variants of the problem.In order to obtain this time bound for HSA, we developed new algorithms for a generalized variant of the Min-Cost Bipartite Matching problem (MCM), as well as to two derivatives of this problem, entitled All-Cavity-MCM and All-Pairs-Cavity-MCM. For two input sets of size n and m, where n ≤ m, MCM and both its cavity derivatives are solved in O(n3 + nm) time, without the usage of priority queues (e.g. Fibonacci heaps) or other complex data structures. This gives the first cubic time algorithm for All-Pairs-Cavity-MCM, and improves the running times of MCM and All-Cavity-MCM problems in the unbalanced case where n ≪ m.We implemented the algorithm (in all modes mentioned above) as a graphical software tool which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran all-against-all inter-family pairwise alignments of RNAse P and Hammerhead RNA family members, exposing new similarities which could not be detected by the traditional rooted ordered alignment approaches. The results demonstrate that our approach can be used to expose structural similarity between some RNAs with higher sensitivity than the traditional rooted ordered alignment approaches. Source code and web-interface for our tool can be found in http://www.cs.bgu.ac.il/\\~negevcb/FRUUT.

Highlights

Secondary structure of RNA molecules serves important functions in many non-coding RNAs [1]
Algorithm for homeomorphic subtree alignment we describe a basic algorithm for Homeomorphic Subtree Alignment (HSA) for its unordered unrooted variant
It can be asserted that removing or adding degree-2 nodes to a tree do not change its maximum degree nor the number of its leaves, and trees with a high number of degree-2 nodes have a low maximum degree and a small number of Algorithms for bipartite matching problems we show efficient algorithms for the Matching problem (MCM), All-Cavity-MCM, and All-Pairs-Cavity-MCM problems defined in Section ‘Min-Cost bipartite matching’

Summary

Background

Secondary structure of RNA molecules serves important functions in many non-coding RNAs [1]. It is possible to show that increasing the flow over a negative cost augmentation path necessarily increases the size of the corresponding matching (otherwise it implies a negative cost cycle in the residual network), the number of iterations in the above described Min-Cost Flow algorithm is |M∗| ≤ n (where M∗ is an optimal matching of minimum size), and so the total running time of the algorithm is O(|M∗|n2 + nm), which is faster than O(n3 + nm) in the case where |M∗| is small. To Kao et al [28], we show that solutions for instances of the form (X \ {x}, Y \ {y}, w) correspond to certain shortest paths in the residual flow network obtained when solving the instance (X, Y , w) This observation allows to solve both All-Cavity-MCM and All-Pairs-Cavity-MCM at the same time complexity O(n3 + nm) as that of the algorithm for MCM presented in the previous section. This was done by multiplying the computed p-value by the number of tests performed (i.e. the number of tree pairs aligned within the family that participated in the corresponding test)

Results

Conclusions

25. Valiente G

29. Dinic E

39. Tarjan R

43. Lawler E: Combinatorial Optimization