Abstract Sensitive removal of host genetic information (i.e., host depletion) is a key first step towards accurate classification of potential microbial DNA sequences in low-biomass samples, such as human tumors. Inadequate filtration of human reads prior to microbial classification can cause false positives, biasing biological conclusions. Previous computational host depletion pipelines predominantly employed the human reference genome GRCh38 (hg38), which lacks genetic diversity of the human population and excludes substantial portions of the Y-chromosome. To address these deficiencies, we constructed a pipeline to maximally remove human reads that incorporates hg38, the first complete genome build T2T-CHM13v2.0 (T2T), and the 47 diploid genomes from the Human Pangenome Reference Consortium (HPRC) [1]. Reads passing maximal host depletion and quality control are considered sufficiently cleaned for downstream microbial analyses. We first evaluated the specificity of our pipeline by simulating Illumina whole genome sequencing (WGS) data from 10 heldout HPRC humans, whose genomes were excluded from the host depletion pipeline. We found significant improvement in the median number of human reads remaining across successive host depletion methods (hg38, 0.05369%; hg38+T2T, 0.00535%; hg38+T2T+HPRC, 0.00093%). For sensitivity, we applied our pipeline to simulated data from 962 microbes in the FDA-ARGOS dataset and found minimal removal of true microbial reads (median 99.9390% reads retained). As a real world example, we evaluated our pipeline on 583 metastatic colorectal cancer tissue (WGS) samples from the Hartwig Medical Foundation (HMF) database (median 106X coverage)[2] and found successive decreases in median number of reads retained (hg38, 185,135; hg38+T2T, 92,951; hg38+T2T+HPRC, 10,314). Using Woltka for taxonomic classification against database RefSeq210, we assessed biological sex similarity across the HMF samples using Robust Aitchison PCA (RPCA). Notably, microbial data following hg38-only host depletion showed significant microbial differences between male and female-labeled samples (p=0.00001), which was biologically unexpected. However, our updated pipeline (hg38+T2T+HPRC) removed the false sex-based separation in the microbial data (p=0.1837). Thus, utilization of diverse sets of complete human genomes for computational host depletion can mitigate artifactual bias, enabling more sensitive and specific host-microbe conclusions in cancer research. [1] Liao, et al. Nature 617, 312-324 (2023). [2] Priestley, et al. Nature 575, 210-216 (2019). Citation Format: Caitlin Guccione, Lucas Patel, Daniel McDonald, Cameron Martino, Antonio González, Gregory D. Sepich-Poore, Rob Knight, Kit Curtius. Improving microbial detection in cancer tissue samples with computational host depletion using the pangenome [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 4954.
Read full abstract