Random generation of RNA secondary structures according to native distributions.

Markus E Nebel,Frank Weinberg,Anika Scheid

doi:10.1186/1748-7188-6-24

Markus E Nebel, Frank Weinberg + Show 1 more

Open Access

https://doi.org/10.1186/1748-7188-6-24

Copy DOI

Abstract

BackgroundRandom biological sequences are a topic of great interest in genome analysis since, according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Accordingly, the generation of random sequences has been investigated for a long time. Similarly, random object of a more complicated structure like RNA molecules or proteins are of interest.ResultsIn this article, we present a new general framework for deriving algorithms for the non-uniform random generation of combinatorial objects according to the encoding and probability distribution implied by a stochastic context-free grammar. Briefly, the framework extends on the well-known recursive method for (uniform) random generation and uses the popular framework of admissible specifications of combinatorial classes, introducing weighted combinatorial classes to allow for the non-uniform generation by means of unranking. This framework is used to derive an algorithm for the generation of RNA secondary structures of a given fixed size. We address the random generation of these structures according to a realistic distribution obtained from real-life data by using a very detailed context-free grammar (that models the class of RNA secondary structures by distinguishing between all known motifs in RNA structure). Compared to well-known sampling approaches used in several structure prediction tools (such as SFold) ours has two major advantages: Firstly, after a preprocessing step in time for the computation of all weighted class sizes needed, with our approach a set of m random secondary structures of a given structure size n can be computed in worst-case time complexity while other algorithms typically have a runtime in . Secondly, our approach works with integer arithmetic only which is faster and saves us from all the discomforting details of using floating point arithmetic with logarithmized probabilities.ConclusionA number of experimental results shows that our random generation method produces realistic output, at least with respect to the appearance of the different structural motifs. The algorithm is available as a webservice at http://wwwagak.cs.uni-kl.de/NonUniRandGen and can be used for generating random secondary structures of any specified RNA type. A link to download an implementation of our method (in Wolfram Mathematica) can be found there, too.

Highlights

As stated in [4], random sequences are a topic of great interest in genome analysis, since according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate
Note that according to [26], Boltzmann samplers may be employed for approximate-size as well as fixed-size random generation and are an alternative to standard combinatorial generators based on the recursive method
Derivation of the Algorithm The elaborate stochastic contextfree grammar (SCFG) Gsto is appropriate for being used as the basis for the desired weighed unranking method: after having determined the reweighting normal form (RNF) of this SCFG and the corresponding weighted combinatorial classes, we find a recursion for the size function

Summary

Introduction

Random biological sequences are a topic of great interest in genome analysis since, according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. According to [4], random biological sequences are for instance widely used for the detection of over-represented and underrepresented motifs, as well as for determining whether scores of pairwise alignments are relevant or not: there exist analytic approaches for these kinds of problems, for the most complex cases, it is often still necessary to be able to alternatively use a corresponding experimental approach (based on randomly generated sequences obtained from a computer programm) For this purpose, random sequences must obviously obey to a certain model that takes into account some relevant properties of actual real-life sequences, where such models are usually based on statistical parameters only.

Results

Discussion

Conclusion