Abstract

Bioinformatic tools are currently being developed to better understand the Mycobacterium tuberculosis complex (MTBC). Several approaches already exist for the identification of MTBC lineages using classical genotyping methods such as mycobacterial interspersed repetitive units—variable number of tandem DNA repeats and spoligotyping-based families. In the recently released SITVIT2 proprietary database of the Institut Pasteur de la Guadeloupe, a large number of spoligotype families were assigned by either manual curation/expertise or using an in-house algorithm. In this study, we present two complementary data-driven approaches allowing fast and precise family prediction from spoligotyping patterns. The first one is based on data transformation and the use of decision tree classifiers. In contrast, the second one searches for a set of simple rules using binary masks through a specifically designed evolutionary algorithm. The comparison with the three main approaches in the field highlighted the good performances of our contributions and the significant runtime gain. Finally, we propose the ‘SpolLineages’ software tool (https://github.com/dcouvin/SpolLineages), which implements these approaches for MTBC spoligotype families’ identification.

Highlights

  • Tuberculosis (TB) is an infectious disease caused by bacteria belonging to the Mycobacterium tuberculosis complex (MTBC), with a broad host range

  • In the SITVIT2 [6] proprietary database of the Institut Pasteur de la Guadeloupe, which is an update of previously released SpolDB/SITVIT databases [7, 8], Lineage 1 is known as EAI; Lineage 2 is known as Beijing; Lineage 3 is known as Central Asian (CAS); Lineage 4 includes Cameroon, Haarlem (H), Latin-American-Mediterranean (LAM), NEW-1, S, T, Turkey, Ural and X; Lineage 5 is known as AFRI 2 and AFRI 3; Lineage 6 is known as AFRI 1 and Lineage 7 is known as Ethiopian

  • We propose novel algorithmic approaches allowing quick and precise prediction of MTBC genotypic families from spoligotyping data, using a decision tree (DT), an evolutionary algorithm (EA) or classical binary rules

Read more

Summary

Introduction

Tuberculosis (TB) is an infectious disease caused by bacteria belonging to the Mycobacterium tuberculosis complex (MTBC), with a broad host range. MTBC includes a group of closely related species: Mycobacterium tuberculosis sensu stricto, Mycobacterium africanum, Mycobacterium bovis, Mycobacterium caprae, Mycobacterium pinnipedii, Mycobacterium suricattae, Mycobacterium orygis, Mycobacterium microti, Mycobacterium mungi and probably other ecotypes yet to be determined Phylogenomic analysis of this group of organisms based on nextgeneration sequencing, digital DNA–DNA hybridization and average nucleotide identity showed that they might be considered as heterotypic synonyms of M. tuberculosis [1]. Seven major TB lineages have been identified: Lineage 1 (Indo-Oceanic), Lineage 2 (EastAsian), Lineage 3 [East-African-Indian (EAI)], Lineage 4 (Euro-American), Lineage 5 (West-Africa 1), Lineage 6 (West-Africa 2) and Lineage 7 (Ethiopian or Aethiops vetus lineage) These lineages are known to cause TB in humans throughout the world, and some of them (such as Lineage 3) are relatively specific to certain regions, whereas others (such as Lineage 4) are more globally distributed [5]. Two newly discovered lineages (Lineage 8 and Lineage 9) seemingly restricted to Africa were recently described [9, 10]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call