Evaluation of methods for detecting human reads in microbial sequencing datasets.

Stephen J Bush,A Sarah Walker,Tim E.A Peto,Derrick W Crook,Thomas R Connor

doi:10.1099/mgen.0.000393

Stephen J Bush, A Sarah Walker + Show 3 more

Open Access

https://doi.org/10.1099/mgen.0.000393

Copy DOI

Abstract

Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to predict those of human origin, and then discarding them. To inform best practice guidelines, we benchmarked eight alignment-based and two classification-based methods of human read detection using simulated data from 10 clinically prevalent bacteria and three viruses, into which contaminating human reads had been added. While the majority of methods successfully detected >99 % of the human reads, they were distinguishable by variance. The most precise methods, with negligible variance, were Bowtie2 and SNAP, both of which misidentified few, if any, bacterial reads (and no viral reads) as human. While correctly detecting a similar number of human reads, methods based on taxonomic classification, such as Kraken2 and Centrifuge, could misclassify bacterial reads as human, although the extent of this was species-specific. Among the most sensitive methods of human read detection was BWA, although this also made the greatest number of false positive classifications. Across all methods, the set of human reads not identified as such, although often representing <0.1 % of the total reads, were non-randomly distributed along the human genome with many originating from the repeat-rich sex chromosomes. For viral reads and longer (>300 bp) bacterial reads, the highest performing approaches were classification-based, using Kraken2 or Centrifuge. For shorter (c. 150 bp) bacterial reads, combining multiple methods of human read detection maximized the recovery of human reads from contaminated short read datasets without being compromised by false positives. A particularly high-performance approach with shorter bacterial reads was a two-stage classification using Bowtie2 followed by SNAP. Using this approach, we re-examined 11 577 publicly archived bacterial read sets for hitherto undetected human contamination. We were able to extract a sufficient number of reads to call known human SNPs, including those with clinical significance, in 6 % of the samples. These results show that phenotypically distinct human sequence is detectable in publicly archived microbial read datasets.

Highlights

Sequencing data from host-associated microbes, including metagenomic read sets, can often be contaminated by the body of the investigator or research subject [1]
Comparing methods for detecting human reads in a contaminated microbial read dataset We evaluated the performance of 20 methods of human read detection, comprising two read classifiers (Centrifuge and Kraken2), both used with two different databases, and eight aligners (Bowtie2 [24], BWA-mem [25], GEM [26], HISAT2 [27], minimap2 [28], Novoalign, SMALT and SNAP [29]), each aligning reads to two different versions of the human primary assembly, GRCh38 and GRCh37
Each method was evaluated using reads simulated from 10 closed bacterial genomes and three viral genomes [hepatitis C, human immunodeficiency virus (HIV), influenza A] (Table S1, available in the online version of this article), to which were added reads simulated from human genome GRCh38

Summary

Introduction

Sequencing data from host-associated microbes, including metagenomic read sets, can often be contaminated by the body of the investigator or research subject [1]. With ever-increasing volumes of genomic (and metagenomic) data being deposited in public archives, there is a practical need to benchmark methods of human read detection so as to inform best practice guidelines These methods follow two basic approaches: subtractive alignment and direct classification. We evaluated several variations on these two basic approaches: by mapping all reads within a mixed dataset to the human genome (using 8 aligners × 2 human genome assemblies), and by predicting read origin with the taxonomic classifiers Centrifuge and Kraken2 [22], using both all-species and human-o nly databases (2 classifiers × 2 databases) This represents 16 different approaches to subtractive alignment and four different approaches to direct classification, 20 methods in total. Re-examined 11 577 publicly archived bacterial read sets to identify hitherto undetected human contamination

Results and discussion

Method

Methods

Methods for detecting human read content

Evaluation metrics

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Microbial Genomics	Publication Date: Jun 19, 2020
Citations: 20	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Evaluation of methods for detecting human reads in microbial sequencing datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Microbial Genomics

Lead the way for us

Similar Papers

No Evidence for Integrated Viral DNA in the Genome Sequence of Cutaneous Squamous Cell Carcinoma
Michelle T Dimon ... Sarah T Arron
Journal of Investigative Dermatology | VOL. 134
Michelle T Dimon, et. al.Michelle T Dimon ... Sarah T Arron
01 Jul 2014
Journal of Investigative Dermatology | VOL. 134

Improved Human Detection with a Fusion of Laser Scanner and Vision/Infrared Information for Mobile Applications
Sebastian Budzan ... Witold Ilewicz
Applied Sciences | VOL. 8
Sebastian Budzan, et. al.Sebastian Budzan ... Witold Ilewicz
18 Oct 2018
Applied Sciences | VOL. 8

Complementary human detection and multiple feature based tracking using a stereo camera
Gakuto Masuyama ... Kazunori Umeda
ROBOMECH Journal | VOL. 4
Gakuto Masuyama, et. al.Gakuto Masuyama ... Kazunori Umeda
29 Sep 2017
ROBOMECH Journal | VOL. 4

Hostile: accurate decontamination of microbial host sequences.
Bede Constantinides ... Martin Hunt
Bioinformatics | VOL. 39
Bede Constantinides, et. al.Bede Constantinides ... Martin Hunt
01 Dec 2023
Bioinformatics | VOL. 39

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluation of methods for detecting human reads in microbial sequencing datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Microbial Genomics