Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly

Yen-Chun Chen,Tzen-Yuh Chiang,Chun-Hui Yu,Tsunglin Liu,Chi-Chuan Hwang,Ying Xu

doi:10.1371/journal.pone.0062856

Abstract

Next-generation-sequencing (NGS) has revolutionized the field of genome assembly because of its much higher data throughput and much lower cost compared with traditional Sanger sequencing. However, NGS poses new computational challenges to de novo genome assembly. Among the challenges, GC bias in NGS data is known to aggravate genome assembly. However, it is not clear to what extent GC bias affects genome assembly in general. In this work, we conduct a systematic analysis on the effects of GC bias on genome assembly. Our analyses reveal that GC bias only lowers assembly completeness when the degree of GC bias is above a threshold. At a strong GC bias, the assembly fragmentation due to GC bias can be explained by the low coverage of reads in the GC-poor or GC-rich regions of a genome. This effect is observed for all the assemblers under study. Increasing the total amount of NGS data thus rescues the assembly fragmentation because of GC bias. However, the amount of data needed for a full rescue depends on the distribution of GC contents. Both low and high coverage depths due to GC bias lower the accuracy of assembly. These pieces of information provide guidance toward a better de novo genome assembly in the presence of GC bias.

Highlights

Genome sequencing and assembly are essential for understanding the secrets behind genomes
We conducted a systematic analysis on the effects of GC bias on genome assembly
GC bias describes the relationship between GC content and read coverage across a genome

Summary

Introduction

Genome sequencing and assembly are essential for understanding the secrets behind genomes. On Illumina system [9], a major NGS platform, it has been reported that extreme base compositions, i.e., GC-poor or GCrich sequences, lead to an uneven coverage or even no coverage of reads across the genome [9,10,11,12,13]. Illumina sequencing of a Plasmodium falciparum genome, which is extremely GC-poor with a mean GC content less than 25%, was found to favor the more GC-balanced regions, leading to few or no reads from the many GC-poor regions [13]. Except Velvet-SC [14], assume a uniform coverage of reads across genomes during assembly. High coverage regions may be treated as repetitive elements, leading to assembly fragmentations. Using the extremely GC biased Illumina data of P. falciparum mentioned above, an assembly was even not possible [13]

Methods

Results

Discussion

Conclusion