Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment

Anurag Nagar,Michael Hahsler

doi:10.1186/1471-2105-14-s11-s2

Anurag Nagar, Michael Hahsler

Open Access

https://doi.org/10.1186/1471-2105-14-s11-s2

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Sep 1, 2013
Citations: 37	License type: CC BY 2.0

Affiliation: Southern Methodist University

Abstract

BackgroundNext Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences.ResultsIn this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly.ConclusionQuasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA.

Highlights

Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem
We find that p = 3, i.e. we count the occurrence of tri-mers within a segment, produces good results while creating Numerical Summarization Vectors (NSV) of length 43 or 64
We present an analysis of the species Leptotrichia buccalis that belongs to the phylum Fusobacteria and genus Leptotrichia

Summary

Results

We processed the entire Greengenes 16S rRNA database using the default settings for creating NSVs and GenModels and analyzed the models for interesting patterns and clusters to search for highly similar or conserved regions across multiple sequences that may be related by taxonomy. To check these results, we performed MSA using Clustal [8] on the segments belonging to quasi-alignments 2 and 3. The search space for the best alignment can be reduced from the entire sequence length to just the strongly quasi-aligned segments This can result in substantial savings in computational resources and time and produce results more efficiently. In case of the species Leptotrichia buccalis, we discovered the region between nucleotide base positions 100-300 contain highly similar sequences. Our algorithm can analyze the entire data set from the Greengenes [22] using a simple personal computer Performing such an analysis using traditional MSA would require extensive server resources and computing time. Quasi-alignment scales well for larger number of sequences and can provide accurate results quickly and efficiently

Conclusion

Background

13. Fickett JW

17. Aggarwal C