Abstract

Direct analysis of unassembled genomic data could greatly increase the power of short read DNA sequencing technologies and allow comparative genomics of organisms without a completed reference available. Here, we compare 174 chloroplasts by analyzing the taxanomic distribution of short kmers across genomes [1]. We then assemble de novo contigs centered on informative variation. The localized de novo contigs can be separated into two major classes: tip = unique to a single genome and group = shared by a subset of genomes. Prior to assembly, we found that ∼18% of the chloroplast was duplicated in the inverted repeat (IR) region across a four-fold difference in genome sizes, from a highly reduced parasitic orchid [2] to a massive algal chloroplast [3], including gnetophytes [4] and cycads [5]. The conservation of this ratio between single copy and duplicated sequence was basal among green plants, independent of photosynthesis and mechanism of genome size change, and different in gymnosperms and lower plants. Major lineages in the angiosperm clade differed in the pattern of shared kmers and de novo contigs. For example, parasitic plants demonstrated an expected accelerated overall rate of evolution, while the hemi-parasitic genomes contained a great deal more novel sequence than holo-parasitic plants, suggesting different mechanisms at different stages of genomic contraction. Additionally, the legumes are diverging more quickly and in different ways than other major families. Small duplicated fragments of the rrn23 genes were deeply conserved among seed plants, including among several species without the IR regions, indicating a crucial functional role of this duplication. Localized de novo assembly of informative kmers greatly reduces the complexity of large comparative analyses by confining the analysis to a small partition of data and genomes relevant to the specific question, allowing direct analysis of next-gen sequence data from previously unstudied genomes and rapid discovery of informative candidate regions.

Highlights

  • Comparative genomics in the next-gen sequencing era Technological advances in genomic sequencing have made it possible to acquire vast amounts of DNA sequence data for any organism quickly and cheaply [6]

  • The simulated error rate (5%) increased the proportion of longer contigs among each set of localized de novo contigs

  • We examined whether sequencing error affected assembly accuracy by aligning a random subset of tip contigs against their reference and found that error had a very limited effect on accuracy (.97% of 366 contigs without error and .96% of 1964 contigs with error were identical to their reference)

Read more

Summary

Introduction

Comparative genomics in the next-gen sequencing era Technological advances in genomic sequencing have made it possible to acquire vast amounts of DNA sequence data for any organism quickly and cheaply [6]. For biologists working on non-model organisms without a reference genome, the de novo assembly of newly sequenced genomes and their comparative analysis is considerably more complicated and difficult. Accurate and full de novo assembly requires prodigious data coverage, the construction of numerous libraries, and extensive finishing of the genome assembly [8], both of which are frequently beyond the scope, budget, and requirements of ecological or evolutionary studies of non-model organisms. While partial assembly can provide informative markers [9], a large fraction of the available genomic data remains unanalyzed. Direct analysis of next-gen genomic sequence data could greatly simplify large comparative studies

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call