Microbial source tracking leverages a wide range of approaches designed to trace the origins of fecal contamination in aquatic environments. Although source tracking methods are typically employed within the laboratory setting, computational techniques can be leveraged to advance microbial source tracking methodology. Herein, we present a logic regression-based supervised learning approach for the discovery of source-informative genetic markers within intergenic regions across the Escherichia coli genome that can be used for source tracking. With just single intergenic loci, logic regression was able to identify highly source-specific (i.e., exceeding 97.00%) biomarkers for a wide range of host and niche sources, with sensitivities reaching as high as 30.00%-50.00% for certain source categories, including pig, sheep, mouse, and wastewater, depending on the specific intergenic locus analyzed. Restricting the source range to reflect the most prominent zoonotic sources of E. coli transmission (i.e., bovine, chicken, human, and pig) allowed for the generation of informative biomarkers for all host categories, with specificities of at least 90.00% and sensitivities between 12.50% and 70.00%, using the sequence data from key intergenic regions, including emrKY-evgAS, ibsB-(mdtABCD-baeSR), ompC-rcsDB, and yedS-yedR, that appear to be involved in antibiotic resistance. Remarkably, we were able to use this approach to classify 48 out of 113 river water E. coli isolates collected in Northwestern Sweden as either beaver, human, or reindeer in origin with a high degree of consensus-thus highlighting the potential of logic regression modeling as a novel approach for augmenting current source tracking efforts.IMPORTANCEThe presence of microbial contaminants, particularly from fecal sources, within water poses a serious risk to public health. The health and economic burden of waterborne pathogens can be substantial-as such, the ability to detect and identify the sources of fecal contamination in environmental waters is crucial for the control of waterborne diseases. This can be accomplished through microbial source tracking, which involves the use of various laboratory techniques to trace the origins of microbial pollution in the environment. Building on current source tracking methodology, we describe a novel workflow that uses logic regression, a supervised machine learning method, to discover genetic markers in Escherichia coli, a common fecal indicator bacterium, that can be used for source tracking efforts. Importantly, our research provides an example of how the rise in prominence of machine learning algorithms can be applied to improve upon current microbial source tracking methodology.
Read full abstract