Abstract
One of the most important issues in the post-genomic molecular biology is the analysis of protein three-dimensional (3-D) structures, and searching over the 3-D structure databases of them is becoming more and more important. The root mean square deviation (RMSD) is the most popular similarity measure for comparing two molecular structures. In this article, we propose new theoretically and practically fast algorithms for the basic problem of finding all the substructures of structures in a structure database of chain molecules (such as proteins), whose RMSDs to the query are within a given constant threshold. The best-known worst-case time complexity for the problem is O(N log m), where N is the database size and m is the query size. The previous best-known expected time complexity for the problem is also O(N log m). We also propose a new breakthrough linear-expected-time algorithm. It is not only a theoretically significant improvement over previous algorithms, but also a practically faster algorithm, according to computational experiments. Our experiments over the whole Protein Data Bank (PDB) database show that our algorithm is 3.6-28 times faster than previously known algorithms, to search for similar substructures whose RMSDs are within 1A to queries of ordinary lengths. We also propose a series of preprocessing algorithms that enable faster queries, though there have been no known indexing algorithm whose query time complexity is better than the above O(N log m) bound. One is an O(N log(2)N)-time and O(N log N)-space preprocessing algorithm with expected query time complexity of O(m + N given complex square root of m). Another is an O(N log N)-time and O(N)-space preprocessing algorithm with expected query time complexity of O(N given complex square root of m + m log (N given m)).(1)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have