Abstract

BackgroundHigh quality sequence alignments of RNA and DNA sequences are an important prerequisite for the comparative analysis of genomic sequence data. Nucleic acid sequences, however, exhibit a much larger sequence heterogeneity compared to their encoded protein sequences due to the redundancy of the genetic code. It is desirable, therefore, to make use of the amino acid sequence when aligning coding nucleic acid sequences. In many cases, however, only a part of the sequence of interest is translated. On the other hand, overlapping reading frames may encode multiple alternative proteins, possibly with intermittent non-coding parts. Examples are, in particular, RNA virus genomes.ResultsThe standard scoring scheme for nucleic acid alignments can be extended to incorporate simultaneously information on translation products in one or more reading frames. Here we present a multiple alignment tool, codaln, that implements a combined nucleic acid plus amino acid scoring model for pairwise and progressive multiple alignments that allows arbitrary weighting for almost all scoring parameters. Resource requirements of codaln are comparable with those of standard tools such as ClustalW.ConclusionWe demonstrate the applicability of codaln to various biologically relevant types of sequences (bacteriophage Levivirus and Vertebrate Hox clusters) and show that the combination of nucleic acid and amino acid sequence information leads to improved alignments. These, in turn, increase the performance of analysis tools that depend strictly on good input alignments such as methods for detecting conserved RNA secondary structure elements.

Highlights

  • High quality sequence alignments of RNA and DNA sequences are an important prerequisite for the comparative analysis of genomic sequence data

  • Multiple sequence alignments are a crucial prerequisite for a diverse set of methods ranging from the reconstruction of phylogenies and the quantification of adaptive evolution, to the detection of conserved RNA secondary structures and protein motifs

  • Examples have been found in prokaryotic [17,18] and even in eukaryotic genomes [19,20]. In this contribution we describe a progressive alignment tool that implements an extended scoring scheme to incorporate simultaneously information on translation products in one or more ([partly] overlapping) reading frames which allows the user to combine all information from both the nucleic acid and amino acid sequences

Read more

Summary

Results

More plausible alignments Not surprisingly, we observe that codaln multiple alignments of coding DNA sequences have a much larger fraction of gaps with a length divisible by three than ClustalW multiple alignments This is the desired effect of including amino acid-based scoring contributions since it reduces biologically implausible frameshifts. While codaln produces a significantly higher fraction of gaps that are a multiple of 3 and correctly aligns the coding sequences in both exons, ClustalW only treats exon 2 correctly, which is highly conserved on the level of nucleic acids. At the 5'-terminal end of the Levivirus sequences we detect a short GC-rich hairpin(tetraloop) adjacent to an unpaired GGG element, see Fig. 6 This feature is probably the analogon to the recognition signal site for the RNA replicase in Alloleviviruses. The Qβ replicase amplifies RNA templates autocatalytically with high efficiency, and the recognition element, consisting of a hairpin and a short unpaired region at the 5'-terminus, is essential for recognition [36,37]

Conclusion
Background
Discussion
A C CCGCGCGCGG G
14. Simmonds P
22. Hein J
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call