Abstract

Most genomic data studies are based on sequence comparisons and searches, and comparison models based on alignment algorithms are most commonly used. This method is very accurate, but it is useful when the query is short in kilobytes, because it requires the quadratic time and space complexity, O(n2) where n is the length of target and query sequences. With the development of Next Generation Sequencing techniques, researches on whole genome sequence data of megabyte size are being actively studied, and new comparison and search methods for large-scale sequence data are needed. We propose a new alignment-free sequence comparison and search method to overcome the limitations of the alignment-based model. In this graphical model, the sequence searching problem in DNA strings can be reduced to find some parts of geometric object within a relatively small-scale geometric space. When comparing similarity by modifying sequences of similar length, we can confirm that the comparison model is appropriate by accurately reflecting the degree of similarity. When searching the query sequence comparison model based on 200MB sized whole genome sequence, using the compressed coordinate information, it was able to search the 10MB sequences in 22s, which is a very reduced time compared to alignment. Although it is not possible to find the exact position of the base pair unit as in the alignment result, it is a model that can be used as a preprocessing process to quickly search a whole genome sequence of several hundred megabytes-size.

Highlights

  • Genomic data studies are done through sequence comparisons, mostly using a model based on an alignment algorithm

  • Instead of considering a single separate base, as in the alignment algorithm, the method compares the vector generated based on the sequence of the predetermined unit only once, and it is possible to significantly reduce the time required for comparison operation by visualizing a sequence search result and presenting the information more intuitively

  • For G, a genome sequence consisting of 4 DNA bases { a, g, t, c }, ranwalk(G) represents a three-dimensional geometric object constructed by our proposed algorithm

Read more

Summary

Introduction

Genomic data studies are done through sequence comparisons, mostly using a model based on an alignment algorithm. Basic Local Alignment Search Tool (BLAST)[1] is the most common method to search for sequences in a database It divides the query sequence into three characters, finds the matching region, and gradually widens the region to select candidates for alignment. We propose a geometric-based heuristic technique that enables the rapid comparison and search of sequences in personal computers In this regard, AMSS[3] is a model that provides shape-based similarity comparison, assuming that the time series data is a vector sequence. Instead of considering a single separate base, as in the alignment algorithm, the method compares the vector generated based on the sequence of the predetermined unit only once, and it is possible to significantly reduce the time required for comparison operation by visualizing a sequence search result and presenting the information more intuitively. We show the effectiveness of the proposed method with experiments on searching for short query sequences on a long sequence

Genome Sequence Visualization
Visualization Tool for Genome Sequence
Sequence Searching method with 3D Random Plot Structure
Vector Allocation for random Plot
Vector Extraction from Random Plot
Computing Similarity and Search on Random Plot
Reference Sequence Slot
Dataset Preparation
Experiment:Artificial Sequence Search over whole genome sequence
Experiment:Biological Sequence Search over whole genome sequence
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call