Abstract

Sequence nearest neighbors problem can be defined as follows. Given a database D of n sequences, preprocess D so that given any query sequence Q, one can quickly find a sequence S in D for which d(S, Q) ≤ d(S, T) for any other sequence T in D. Here d(S, Q) denotes the “distance” between sequences S and Q, which can be defined as the minimum number of “edit operations” to transform one sequence into the other. The edit operations considered in this paper include single character edits (insertions, deletions, replacements) as well as block (substring) edits (copying, uncopying and relocating blocks).One of the main application domains for the sequence nearest neighbors problem is computational genomics where available tools for sequence comparison and search usually focus on edit operations involving single characters only. While such tools are useful for capturing certain evolutionary mechanisms (mainly point mutations), they may have limited applicability for understanding mechanisms for segmental rearrangements (duplications, translocations and deletions) underlying genome evolution. Recent improvements towards the resolution of the human genome composition suggest that such segmental rearrangements are much more common than what was estimated before. Thus there is substantial need for incorporating similarity measures that capture block edit operations in genomic sequence comparison and search. Unfortunately even the computation of a block edit distance between two sequences under any set of non-trivial edit operations is NP-hard.The first efficient data structure for approximate sequence nearest neighbor search for any set of non-trivial edit operations were described in [11]; the measure considered in this pape is the block edit distance.This method achieves a preprocessing time and space polynomial in size of D and query time near-linear in size of Q by allowing an approximate factor of O(log l(log* l)2).The approach involves embedding sequences into Hamming space so that approximating Hamming distances estimates sequence block edit distances within the approximation ratio above.In this study we focus on simplification and experimental evaluation of the [11] method. We first describe how we implement and test the accuracy of the transformations provided in [11] in terms of estimating the block edit distance under controlled data sets. Then, based on the hamming distance estimator described in [3] we present a data structure for computing approximate nearest neighbors in hamming space; this is simpler than the well-known ones in [9,6]. We finally report on how well the combined data structure performs for sequence nearest neighbor search under block edit distance.KeywordsQuery SequenceEdit DistanceEdit OperationNeighbor ProblemCore BlockThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.