Abstract
BackgroundChloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10–30 kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly.ResultsHybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate. The chloroplast genome of Eucalyptus pauciflora is 159,942 bp, contains 131 genes of known function.ConclusionsOur results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least ~5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies.
Highlights
Chloroplasts are organelles that conduct photosynthesis in plant and algal cells
By comparing a large range of different approaches to chloroplast genome assembly, we show that hybrid assembly with at least 20x coverage of long-reads and 20x coverage of short-reads is sufficient to assemble the entire chloroplast genome into a single contig with few or no errors
Given the low cost and simplicity of generating long-read data from the Oxford Nanopore Technologies (ONT) sequencer (MinION) [22, 34], the extremely low cost of producing high-coverage short-read data using Illumina technology, and the potential to multiplex multiple samples on both devices, this provides a clear path towards producing multiple highly-accurate and complete chloroplast genome assemblies for very low cost
Summary
Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. Chloroplasts are important organelles in algal and plant cells, which generate carbohydrates by photosynthesis [1]. The chloroplast genome provides important information for phylogenetics, population-genetics and species identification [1,2,3,4,5,6,7,8], and is the focus of genetic engineering because it contains many genes involved in photosynthesis [1]. The structure of chloroplast genome is highly conserved among plants, and usually consists of a long single copy and a short single copy region (the long single copy region is longer and contains more genes than the short single copy region), separated by two identical inverted repeat regions. There are more than 1500 chloroplast genomes available in the NCBI organelle genome database
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have