A gene-by-gene population genomics platform: de novo assembly, annotation and genealogical analysis of 108 representative Neisseria meningitidis genomes.

Holly B Bratcher,Craig Corton,Julian Parkhill,Keith A Jolley,Martin Cj Maiden

doi:10.1186/1471-2164-15-1138

Holly B Bratcher, Craig Corton + Show 3 more

Open Access

https://doi.org/10.1186/1471-2164-15-1138

Copy DOI

Journal: BMC genomics	Publication Date: Dec 1, 2014
Citations: 226	License type: cc-by

Affiliation: Wellcome Sanger Institute, University of Oxford

Abstract

BackgroundHighly parallel, ‘second generation’ sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary.ResultsThe performance of de novo short-read assembly followed by automatic annotation using the pubMLST.org Neisseria database was assessed and evaluated for 108 diverse, representative, and well-characterised Neisseria meningitidis isolates. High-quality sequences were obtained for >99% of known meningococcal genes among the de novo assembled genomes and four resequenced genomes and less than 1% of reassembled genes had sequence discrepancies or misassembled sequences. A core genome of 1600 loci, present in at least 95% of the population, was determined using the Genome Comparator tool. Genealogical relationships compatible with, but at a higher resolution than, those identified by multilocus sequence typing were obtained with core genome comparisons and ribosomal protein gene analysis which revealed a genomic structure for a number of previously described phenotypes. This unified system for cataloguing Neisseria genetic variation in the genome was implemented and used for multiple analyses and the data are publically available in the PubMLST Neisseria database.ConclusionsThe de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy. The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples. The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2164-15-1138) contains supplementary material, which is available to authorized users.

Highlights

Parallel, ‘second generation’ sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics
There are many questions in bacterial biology, which can be adequately addressed with population genomic approaches that employ subsets of the genome [44], such as multilocus sequence typing (MLST) (Figure 1), rMLST (Figure 4) and cgMLST; and for these analyses NGS datasets provide a rich source of information [15]
To reflect the increased resolution of whole genome typing we propose the use of a lineage nomenclature (Table 4) to distinguish groupings obtained by rMLST and cgMLST from the clonal complex association identified by MLST

Summary

Introduction

Parallel, ‘second generation’ sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. The widespread application of parallel high-throughput ‘ generation’ sequencing (NGS) technologies has made whole genome sequence (WGS) data available for tens of thousands of bacterial isolates [1] These data are publicly available only as depositions in short-read sequence archives: in December 2013 the European Bioinformatics Institute (EBI) Sequence Read Archive (SRA), contained more than 100,000 bacterial WGS records, over 90% of which comprised millions of short sequence reads each of fewer than 200 bases in length. These data represent a major resource for studies of bacterial diversity, evolution and function; as the throughput of genome finishing and annotation technologies has not kept pace with sequence determination, the genomes have to be reassembled to be interpreted. The Bacterial Isolate Genome Sequence Database (BIGSdb) platform provides this functionality for WGS data [17]

Methods

Results

Discussion

Conclusion