Genome-Enabled Molecular Subtyping and Serotyping for Shiga Toxin-Producing Escherichia coli

Sung B Im,Mani Jain,Lavanya Rishishwar,Aroon T Chande,Heather A Carleton,I King Jordan,Sonali Gupta

doi:10.3389/fsufs.2021.752873

Abstract

Foodborne pathogens are a major public health burden in the United States, leading to 9.4 million illnesses annually. Since 1996, a national laboratory-based surveillance program, PulseNet, has used molecular subtyping and serotyping methods with the aim to reduce the burden of foodborne illness through early detection of emerging outbreaks. PulseNet affiliated laboratories have used pulsed-field gel electrophoresis (PFGE) and immunoassays to subtype and serotype bacterial isolates. Widespread use of serotyping and PFGE for foodborne illness surveillance over the years has resulted in the accumulation of a wealth of routine surveillance and outbreak epidemiological data. This valuable source of data has been used to understand seasonal frequency, geographic distribution, demographic information, exposure information, disease severity, and source of foodborne isolates. In 2019, PulseNet adopted whole genome sequencing (WGS) at a national scale to replace PFGE with higher-resolution methods such as the core genome multilocus sequence typing. Consequently, PulseNet's recent shift to genome-based subtyping methods has rendered the vast collection of historic surveillance data associated with serogroups and PFGE patterns potentially unusable. The goal of this study was to develop a bioinformatics method to associate the WGS data that are currently used by PulseNet for bacterial pathogen subtyping to previously characterized serogroup and PFGE patterns. Previous efforts to associate WGS to PFGE patterns relied on predicting DNA molecular weight based on restriction site analysis. However, these approaches failed owing to the non-uniform usage of genomic restriction sites by PFGE restriction enzymes. We developed a machine learning approach to classify isolates to their most probable serogroup and PFGE pattern, based on comparisons of genomic k-mer signatures. We applied our WGS classification method to 5,970 Shiga toxin-producing Escherichia coli (STEC) isolates collected as part of PulseNet's routine foodborne surveillance activities between 2003 and 2018. Our machine learning classifier is able to associate STEC WGS to higher-level serogroups with very high accuracy and lower-level PFGE patterns with somewhat lower accuracy. Taken together, these classifications support the ability of public health investigators to associate currently generated WGS data with historical epidemiological knowledge linked to serogroups and PFGE patterns in support of outbreak surveillance for food safety and public health.

Highlights

In the United States, foodborne pathogens are responsible for an estimated 9.4 million illnesses each year, leading to over 55,000 hospitalizations and ∼1,300 deaths (Scallan et al, 2011)
Isolates were selected from the PulseNet Shiga toxin-producing Escherichia coli (STEC) National Database (Swaminathan et al, 2001) based on the availability of their complete O-antigen serogroup and pulsedfield gel electrophoresis (PFGE) test results, and whole genome sequence data submitted to NCBI’s Sequence Read Archive (SRA) (Supplementary Table 1)
We evaluated 5,970 STEC isolates collected in US as part of the routine surveillance efforts of the PulseNet network of public health laboratories since 1996

Summary

Introduction

In the United States, foodborne pathogens are responsible for an estimated 9.4 million illnesses each year, leading to over 55,000 hospitalizations and ∼1,300 deaths (Scallan et al, 2011). Widespread use of serotyping and PFGE for foodborne illness surveillance over many years has resulted in the accumulation of data from nearly 1 million isolates from routine surveillance, associated with distinct serogroups and PFGE patterns, e.g., seasonal frequency, geographic distribution, demographic information, exposure information, disease severity, and common outbreak sources and vehicles (Tolar et al, 2019). This collection of outbreak data has provided investigators with links to more rapidly determine contamination sources of similar outbreaks (Tolar et al, 2019)

Objectives

Methods

Results