This paper describes a genetic algorithm for predicting RNA structures that contain various types of pseudoknots. Pseudoknotted RNA structures are much more difficult to predict by computational methods than RNA secondary structures, as they are more complex and the analysis is time-consuming. We developed an efficient genetic algorithm to predict RNA folding structures containing any type of pseudoknot, as well as a novel initial population method to decrease computational complexity and increase the accuracy of the results. We also used an interaction filter to decrease the size of the possible stem lists for long RNA sequences. We predicted RNA structures using a number of different termination conditions and compared the validity of the results and the times required for the analyses. The algorithm proved able to predict efficiently RNA structures containing various types of pseudoknots. Corresponding Author: Kyungsook Han (Email: khan@inha.ac.kr) This work was supported by the Korea Science and Engineering Foundation (KOSEF) under grant R01-2003000-10461-0. Introduction The prediction of an RNA structure with a pseudoknot using computational methods requires much computation. Predicting the most stable structure with minimal free energy from an RNA sequence is an optimization problem (Lee and Han, 2002; Lee and Han, 2003; Deiman and Pleij, 1997). Computational methods for predicting RNA structure generally make use of two algorithms, one combinatorial the other recursive. The combinatorial algorithm first creates an inventory of all possible stem lists that can be formed by a given RNA sequence, and then determines the combination with the lowest free energy. This algorithm has the advantage that it can include pseudoknot structures, but the number of possible structures increases immensely with sequence length (Rivas and Eddy, 1999; Akutsu, 2000). The recursive algorithm finds the lowest free energy structure from the sub-fragments of a sequence. It makes a systematic search of all sub-fragments for the lowest free energy structure containing at least one base pair. The first sub-fragments considered are those capable of forming a hairpin loop closed by a single base pair. So in a first pass it will find the lowest free energy structures for all pentanucleotides in the sequence. This method always finds the structure with least free energy, but it does not identify structures such as pseudoknots because of their computational complexity. A genetic algorithm (GA) is an optimization procedure that implements the mechanism of biological evolution. It begins with a set of solutions called populations. Solutions are then taken and used to form a new population in the hope that the new population will be superior to the old one. They are selected to generate new solutions according to their fitness; the fitter they are, the more opportunities they have to reproduce. This procedure is repeated until some specified condition is satisfied. Genetic algorithms have been theoretically and empirically proven to provide robust searches in highly complex and uncertain spaces, and they are finding widespread application in commerce, science and engineering. They are computationally simple and powerful search methods, and many workers have used them to predict RNA structures and sequence alignments; they have been used to seek optimal and sub-optimal secondary RNA structures (Benedetti and Morosetti, 1995; Shapiro and Navetta, 1994) and to simulate RNA folding pathways (Gultyaev et al., 1995; Shapiro et al., 2001). Massively parallel genetic algorithms have been employed to predict RNA structures that include pseudoknots (Shapiro and Wu, 1996; Shapiro and Wu, 1997). However the structures predicted contained only H (Hairpin)-type pseudoknots and the computations were extremely complex as they used randomly generated initial populations. Dynamic programming algorithms also used to predict RNA structures including pseudoknots (Rivas and Eddy, 1999) again could only predict structures with H type pseudoknots, and only from short RNA sequences. We have developed a GA that is able to predict efficiently
Read full abstract