Construction of a virtual Mycobacterium tuberculosis consensus genome and its application to data from a next generation sequencer.

Kayo Okumura,Mitsunori Kayano,Teruo Kirikae,Tohru Miyoshi-Akiyama,Masako Kato

doi:10.1186/s12864-015-1368-9

Abstract

BackgroundAlthough Mycobacterium tuberculosis isolates are consisted of several different lineages and the epidemiology analyses are usually assessed relative to a particular reference genome, M. tuberculosis H37Rv, which might introduce some biased results. Those analyses are essentially based genome sequence information of M. tuberculosis and could be performed in sillico in theory, with whole genome sequence (WGS) data available in the databases and obtained by next generation sequencers (NGSs). As an approach to establish higher resolution methods for such analyses, whole genome sequences of the M. tuberculosis complexes (MTBCs) strains available on databases were aligned to construct virtual reference genome sequences called the consensus sequence (CS), and evaluated its feasibility in in sillico epidemiological analyses.ResultsThe consensus sequence (CS) was successfully constructed and utilized to perform phylogenetic analysis, evaluation of read mapping efficacy, which is crucial for detecting single nucleotide polymorphisms (SNPs), and various MTBC typing methods virtually including spoligotyping, VNTR, Long sequence polymorphism and Beijing typing. SNPs detected based on CS, in comparison with H37Rv, were utilized in concatemer-based phylogenetic analysis to determine their reliability relative to a phylogenetic tree based on whole genome alignment as the gold standard. Statistical comparison of phylogenic trees based on CS with that of H37Rv indicated the former showed always better results that that of later. SNP detection and concatenation with CS was advantageous because the frequency of crucial SNPs distinguishing among strain lineages was higher than those of H37Rv. The number of SNPs detected was lower with the consensus than with the H37Rv sequence, resulting in a significant reduction in computational time. Performance of each virtual typing was satisfactory and accorded with those published when those are available.ConclusionsThese results indicated that virtual CS constructed from genome sequence data is an ideal approach as a reference for MTBC studies.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-015-1368-9) contains supplementary material, which is available to authorized users.

Highlights

Mycobacterium tuberculosis isolates are consisted of several different lineages and the epidemiology analyses are usually assessed relative to a particular reference genome, M. tuberculosis H37Rv, which might introduce some biased results
Construction and features of a virtual M. tuberculosis consensus genome We constructed two types of M. tuberculosis consensus genome sequences, one consisting of 13 M. tuberculosis strains and the M. bovis BCG, M. africanum and M. canettii strains shown in Table 1, and the other consisting only of the 13 M. tuberculosis strains
From consensus sequence (CS) was always smaller than that based on whole genome alignment and single nucleotide polymorphisms (SNPs) concatemers derived from H37Rv (Table 3). These results indicated that SNP concatemers based on different reference sequences behave differently in phylogenetic analysis, emphasizing the critical importance of selecting the proper reference sequence, and CS is superior to H37Rv when it is used as the reference sequence in phylogenetic analysis of M. tuberculosis complexes (MTBCs)

Summary

Introduction

Mycobacterium tuberculosis isolates are consisted of several different lineages and the epidemiology analyses are usually assessed relative to a particular reference genome, M. tuberculosis H37Rv, which might introduce some biased results. A variety of molecular typing methods have been used to classify M. tuberculosis strains for epidemiological studies, including assessment of the presence of the IS6110 restriction fragment length polymorphism (RFLP) [5], spoligotyping, analysis of mycobacterial interspersed repetitive unit-variable number tandem repeats (MIRU-VNTR) [6] and large sequence polymorphisms (LSPs) [7,8] Their target sequences are mobile elements (e.g. IS6110), Okumura et al BMC Genomics (2015) 16:218 repetitive sequences (e.g. spoligotyping and MIRU-VNTR) and relatively long sequence polymorphisms (at least 7 bp), with many strains belonging to unrelated lineages possessing these DNA elements in common. This has led to the formulation of optimized sets, including a 15-locus system as the new standard for routine epidemiological discrimination and a 24-locus system as a high-resolution tool for phylogenetic studies [14] These 15- and 24-locus VNTR locus systems have been utilized for first-line typing, they are insufficient in distinguishing among closely related strains of the Beijing family to define deep phylogenetic structures [15]. That conventional molecular typing methods for MTBC are limited in distinguishing among strain subtypes

Methods

Results

Conclusion