Abstract

BackgroundRecent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths.Methodology/Principal FindingsWe define a metric H(k) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ΔH(k) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ΔH(k)>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ΔH(k)<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ΔH(k)<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 109 bp) and 320 bp for the sequencing of fruit fly (1.8×108 bp). We also calculated the ΔH(k) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures.Conclusions/SignificanceThe proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available.

Highlights

  • The development of the generation sequencing technologies (NGS) raised the hope to conduct true haplotype analysis of human genome [1] and for rapid full genome sequencing and typing of various organisms

  • Most BAC sequences have small deficiency from the reconstructed sequence coverage except the following three sequences: AC009243, AC092242 and AC007329. These three sequences have significantly incomplete coverage such that the sequence coverage by SHARCGS assembled contigs is less than 90%. All of these three sequences lose more than 1% entropy at read length of 30 bp

  • The evaluation of k-substring entropy shows that the genomes of different organisms may have distinct repeat structures that impose limitation on sequencing at a certain read length regardless of their genome sizes

Read more

Summary

Introduction

The development of the generation sequencing technologies (NGS) raised the hope to conduct true haplotype analysis of human genome [1] and for rapid full genome sequencing and typing of various organisms. In the first phase of the project, the goal was to generate low coverage whole genome shotgun sequencing of 185 individuals. These data were produced in order to validate millions of published genetic variations including single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and other structural variants. Soon after the announcement of the project, another group of scientists started the Genome 10 K project in 2009 which aims to ‘‘assemble a genomic zoo’’ by sequencing the genomes of vertebrate animals [3] These studies help us understand the correlation between genotypes and phenotypes if large-scale genome shotgun sequencing could be unambiguously and accurately assembled. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (errorfree) reads at different lengths

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call