Abstract

BackgroundRandomly shuffled sequences are routinely used in sequence analysis to evaluate the statistical significance of a biological sequence. In many cases, biologists need sophisticated shuffling tools that preserve not only the counts of distinct letters but also higher-order statistics such as doublet counts, triplet counts, and, in general, k-let counts.ResultsWe present a sequence analysis tool (named uShuffle) for generating uniform random permutations of biological sequences (such as DNAs, RNAs, and proteins) that preserve the exact k-let counts. The uShuffle tool implements the latest variant of the Euler algorithm and uses Wilson's algorithm in the crucial step of arborescence generation. It is carefully engineered and extremely efficient. The uShuffle tool achieves maximum flexibility by allowing arbitrary alphabet size and let size. It can be used as a command-line program, a web application, or a utility library. Source code in C, Java, and C#, and integration instructions for Perl and Python are provided.ConclusionThe uShuffle tool surpasses existing implementation of the Euler algorithm in both performance and flexibility. It is a useful tool for the bioinformatics community.

Highlights

  • Shuffled sequences are routinely used in sequence analysis to evaluate the statistical significance of a biological sequence

  • Altschul and Erickson [2] presented the first algorithm for generating truly uniform random sequences that preserve either the doublet counts or the triplet counts or both; a crucial step of their algorithm for generating random arborescences depends on a trial-and-error procedure, which is a potential bottleneck in performance

  • We have performed two sets of experiments to test the performance of two major forms of the uShuffle tool: we first benchmark the performance of the uShuffle C library, compare the performance of the uShuffle Java applet with the shufflet program by Coward [11]

Read more

Summary

Introduction

Shuffled sequences are routinely used in sequence analysis to evaluate the statistical significance of a biological sequence. It is known that the stability of an RNA secondary structure depends crucially on the stackings of adjacent base pairs; the frequencies of distinct doublets in the random sequences are important considerations in such analysis [4,25]. Biologists need sophisticated shuffling tools that preserve the counts of distinct letters and higher-order statistics such as doublet counts, triplet counts, and, in general, k-let counts

Objectives
Methods
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.