A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking.

Yen-Yi Liu,Chih-Chieh Chen,Yung-Fu Chang

doi:10.1371/journal.pone.0260293

Abstract

BackgroundAs whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented in molecular epidemiology. However, some intrinsic problems remain. For example, genomic sequences with varying read depths, read lengths, and assemblers influence the genome assemblies, introducing error or missing alleles into the generated allelic profiles. These errors and missing alleles might create “specious discrepancy” among closely related isolates, thus making accurate epidemiological interpretation challenging. In addition, the rapid growth of the cgMLST allelic profile database can cause problems related to storage and maintenance as well as long query search times.MethodsWe attempted to resolve these issues by decreasing the scheme size to reduce the occurrence of error and missing alleles, alleviate the storage burden, and improve the query search time. The challenge in this approach is maintaining the typing resolution when using fewer loci. We achieved this by using a popular artificial intelligence technique, XGBoost, coupled with Shapley additive explanations for feature selection. Finally, 370 loci from the original 1701 cgMLST loci of Listeria monocytogenes were selected.ResultsAlthough the size of the final scheme (LmScheme_370) was approximately 80% lower than that of the original cgMLST scheme, its discriminatory power, tested for 35 outbreaks, was concordant with that of the original cgMLST scheme. Although we used L. monocytogenes as a demonstration in this study, the approach can be applied to other schemes and pathogens. Our findings might help elucidate gene-by-gene–based epidemiology.

Highlights

The size of the final scheme (LmScheme_370) was approximately 80% lower than that of the original core genome MLST (cgMLST) scheme, its discriminatory power, tested for 35 outbreaks, was concordant with that of the original cgMLST scheme
With the increasing use of next-generation sequencing (NGS) to investigate pathogen genomes, gene-by-gene (GBG) approaches, including multilocus sequence typing (MLST) [1], whole-genome MLST [2], and core genome MLST [3], have become more frequently applied in genomic epidemiology [4]. cgMLST is the mainstream NGS-based typing method, and it has been successfully applied in the detection of outbreak clusters [5, 6]
Several cgMLST schemes exist for L. monocytogenes, such as those of Ruppitsch et al [10] and Moura et al [9], we selected the scheme published by Ruppitsch et al [10] because the allelic sequences are downloadable

Summary

Introduction

With the increasing use of next-generation sequencing (NGS) to investigate pathogen genomes, gene-by-gene (GBG) approaches, including multilocus sequence typing (MLST) [1], whole-genome MLST (wgMLST) [2], and core genome MLST (cgMLST) [3], have become more frequently applied in genomic epidemiology [4]. cgMLST is the mainstream NGS-based typing method, and it has been successfully applied in the detection of outbreak clusters [5, 6]. We attempted to resolve the problems of specious discrepancy (i.e., using fewer loci to reduce the error rate), storage burden, and query search time by reducing the scheme size. The challenge of this approach is to retain the discriminatory power of the scheme to distinguish outbreaks with a greatly reduced typing scheme size. Genomic sequences with varying read depths, read lengths, and assemblers influence the genome assemblies, introducing error or missing alleles into the generated allelic profiles These errors and missing alleles might create “specious discrepancy” among closely related isolates, making accurate epidemiological interpretation challenging. The rapid growth of the cgMLST allelic profile database can cause problems related to storage and maintenance as well as long query search times

Methods

Results

Discussion

Conclusion