PacBio But Not Illumina Technology Can Achieve Fast, Accurate and Complete Closure of the High GC, Complex Burkholderia pseudomallei Two-Chromosome Genome.

Jade L L Teng,Elaine Chan,Samson S Y Wong,Man Lung Yeung,Lilong Jia,Susanna K P Lau,Patrick C Y Woo,Herman Tse,Pak Chung Sham,Yi Huang,Chi Ho Lin

doi:10.3389/fmicb.2017.01448

Jade L L Teng, Elaine Chan + Show 9 more

Open Access

https://doi.org/10.3389/fmicb.2017.01448

Copy DOI

Abstract

Although PacBio third-generation sequencers have improved the read lengths of genome sequencing which facilitates the assembly of complete genomes, no study has reported success in using PacBio data alone to completely sequence a two-chromosome bacterial genome from a single library in a single run. Previous studies using earlier versions of sequencing chemistries have at most been able to finish bacterial genomes containing only one chromosome with de novo assembly. In this study, we compared the robustness of PacBio RS II, using one SMRT cell and the latest P6-C4 chemistry, with Illumina HiSeq 1500 in sequencing the genome of Burkholderia pseudomallei, a bacterium which contains two large circular chromosomes, very high G+C content of 68–69%, highly repetitive regions and substantial genomic diversity, and represents one of the largest and most complex bacterial genomes sequenced, using a reference genome generated by hybrid assembly using PacBio and Illumina datasets with subsequent manual validation. Results showed that PacBio data with de novo assembly, but not Illumina, was able to completely sequence the B. pseudomallei genome without any gaps or mis-assemblies. The two large contigs of the PacBio assembly aligned unambiguously to the reference genome, sharing >99.9% nucleotide identities. Conversely, Illumina data assembled using three different assemblers resulted in fragmented assemblies (201–366 contigs), sharing only 92.2–100% and 92.0–100% nucleotide identities to chromosomes I and II reference sequences, respectively, with no indication that the B. pseudomallei genome consisted of two chromosomes with four copies of ribosomal operons. Among all assemblies, the PacBio assembly recovered the highest number of core and virulence proteins, and housekeeping genes based on whole-genome multilocus sequence typing (wgMLST). Most notably, assembly solely based on PacBio outperformed even hybrid assembly using both PacBio and Illumina datasets. Hybrid approach generated only 74 contigs, while the PacBio data alone with de novo assembly achieved complete closure of the two-chromosome B. pseudomallei genome without additional costly bench work and further sequencing. PacBio RS II using P6-C4 chemistry is highly robust and cost-effective and should be the platform of choice in sequencing bacterial genomes, particularly for those that are well-known to be difficult-to-sequence.

Highlights

Since the release of the first complete bacterial genome sequence in 1995 (Fleischmann et al, 1995), genome sequencing has been the cornerstone of studying any bacterial species
Sequence data generated from both PacBio and Illumina were used for de novo assembly using hybrid approach in an attempt to generate a reference genome for subsequent comparison and analyses
We investigated whether the assemblies from each platform could generate the correct MLST profile for this isolate, which was previously determined to be of sequence type (ST)-70 by conventional PCR and DNA sequencing using primers suggested by MLST website for typing of B. pseudomallei

Summary

Introduction

Since the release of the first complete bacterial genome sequence in 1995 (Fleischmann et al, 1995), genome sequencing has been the cornerstone of studying any bacterial species. In the 1990s and early 2000s, bacterial genome sequencing was performed by the random shotgun approach, through physical shearing of the bacterial chromosomal DNA, cloning of the sheared fragments, sequencing individual clones and assembling the sequences using computer software. This approach using lowthroughput long-read Sanger sequencing is extremely labor intensive and expensive. The Illumina HiSeq platform utilizes sequencing by synthesis technology where fluorescently labeled reversible terminator nucleotides are incorporated into growing DNA strands and imaged via their fluorophore excitation at the point of incorporation This method provides true baseby-base sequencing that virtually eliminates errors and up to 750 Gb of data can be produced per sequencing run. Illumina platforms are limited by its read length, currently ranging from 25 to 300 bases, and as it requires PCR amplification of multiple DNA templates before sequencing, there is potential for base-composition bias which may bias the G+C content of the sequences (Goodwin et al, 2016)

Methods

Results

Conclusion