Abstract

Pathogens of the Mycobacterium tuberculosis complex (MTBC) are considered to be monomorphic, with little gene content variation between strains. Nevertheless, several genotypic and phenotypic factors separate strains of the different MTBC lineages (L), especially L5 and L6 (traditionally termed Mycobacterium africanum) strains, from each other. However, this genome variability and gene content, especially of L5 strains, has not been fully explored and may be important for pathobiology and current approaches for genomic analysis of MTBC strains, including transmission studies. By comparing the genomes of 355 L5 clinical strains (including 3 complete genomes and 352 Illumina whole-genome sequenced isolates) to each other and to H37Rv, we identified multiple genes that were differentially present or absent between H37Rv and L5 strains. Additionally, considerable gene content variability was found across L5 strains, including a split in the L5.3 sub-lineage into L5.3.1 and L5.3.2. These gene content differences had a small knock-on effect on transmission cluster estimation, with clustering rates influenced by the selected reference genome, and with potential overestimation of recent transmission when using H37Rv as the reference genome. We conclude that full capture of the gene diversity, especially high-resolution outbreak analysis, requires a variation of the single H37Rv-centric reference genome mapping approach currently used in most whole-genome sequencing data analysis pipelines. Moreover, the high within-lineage gene content variability suggests that the pan-genome of M. tuberculosis is at least several kilobases larger than previously thought, implying that a concatenated or reference-free genome assembly (de novo) approach may be needed for particular questions.

Highlights

  • Tuberculosis (TB) is caused by pathogenic bacteria of the Mycobacterium tuberculosis complex (MTBC) that consists of strains of nine human-a­dapted lineages and several animal-­adapted lineages [3,4,5,6]

  • We found that multiple genes are differentially present or absent between H37Rv and lineage 5 (L5) strains, and that there is high within-­L5 gene content variability, resulting in the split of the sub-­lineage L5.3 into L5.3.1 and L5.3.2

  • We found that the current H37Rv-­centric approach widely used in MTBC epidemiology would overestimate the clustering rate of L5 strains since it misses single-­nucleotide polymorphisms present in L5-o­ nly genes

Read more

Summary

Introduction

Tuberculosis (TB) is caused by pathogenic bacteria of the Mycobacterium tuberculosis complex (MTBC) that consists of strains of nine human-a­dapted lineages and several animal-­adapted lineages [3,4,5,6]. This group is highly clonal with no detected horizontal gene transfer [7, 8]. Strains of particular lineages are primarily defined by large sequence polymorphisms (LSPs, the presence or deletion of genomic regions) such as the TbD1 region (MTBC-s­pecific deletion 1) [9], other regions of difference (RDs) [10, 7, 11,12,13] and signature single-n­ucleotide polymorphisms (SNPs) [14]. The reasons for the geographical restriction of L5 and L6 remain unclear, adaptation to particular human subpopulations has been suggested [10, 22,23,24]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call