Shotgun metagenomics of soil invertebrate communities reflects taxonomy, biomass, and reference genome properties.

  • Abstract
  • PDF
  • Literature Map
  • References
  • Citations
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Metagenomics – shotgun sequencing of all DNA fragments from a community DNA extract – is routinely used to describe the composition, structure, and function of microorganism communities. Advances in DNA sequencing and the availability of genome databases increasingly allow the use of shotgun metagenomics on eukaryotic communities. Metagenomics offers major advances in the recovery of biomass relationships in a sample, in comparison to taxonomic marker gene‐based approaches (metabarcoding). However, little is known about the factors which influence metagenomics data from eukaryotic communities, such as differences among organism groups, the properties of reference genomes, and genome assemblies.We evaluated how shotgun metagenomics records composition and biomass in artificial soil invertebrate communities at different sequencing efforts. We generated mock communities of controlled biomass ratios from 28 species from all major soil mesofauna groups: mites, springtails, nematodes, tardigrades, and potworms. We shotgun sequenced these communities and taxonomically assigned them with a database of over 270 soil invertebrate genomes.We recovered over 95% of the species, and observed relatively high false‐positive detection rates. We found strong differences in reads assigned to different taxa, with some groups (e.g., springtails) consistently attracting more hits than others (e.g., enchytraeids). Original biomass could be predicted from read counts after considering these taxon‐specific differences. Species with larger genomes, and with more complete assemblies, consistently attracted more reads than species with smaller genomes. The GC content of the genome assemblies had no effect on the biomass–read relationships. Results were similar among different sequencing efforts.The results show considerable differences in taxon recovery and taxon specificity of biomass recovery from metagenomic sequence data. The properties of reference genomes and genome assemblies also influence biomass recovery, and they should be considered in metagenomic studies of eukaryotes. We show that low‐ and high‐sequencing efforts yield similar results, suggesting high cost‐efficiency of metagenomics for eukaryotic communities. We provide a brief roadmap for investigating factors which influence metagenomics‐based eukaryotic community reconstructions. Understanding these factors is timely as accessibility of DNA sequencing and momentum for reference genomes projects show a future where the taxonomic assignment of DNA from any community sample becomes a reality.

ReferencesShowing 10 of 68 papers
  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 223
  • 10.1371/journal.pone.0010209
Metagenomic Sequencing of an In Vitro-Simulated Microbial Community
  • Apr 16, 2010
  • PLoS ONE
  • Jenna L Morgan + 2 more

  • Open Access Icon
  • Cite Count Icon 2711
  • 10.1073/pnas.1921046117
RepeatModeler2 for automated genomic discovery of transposable element families
  • Apr 16, 2020
  • Proceedings of the National Academy of Sciences
  • Jullien M Flynn + 6 more

  • Open Access Icon
  • Cite Count Icon 84
  • 10.1111/1755-0998.12620
A comparison of DNA extraction methods for high‐throughput DNA analyses
  • Nov 16, 2016
  • Molecular Ecology Resources
  • Lauren M Schiebelhut + 4 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 84
  • 10.3389/fevo.2020.581835
Metabarcoding From Microbes to Mammals: Comprehensive Bioassessment on a Global Scale
  • Nov 30, 2020
  • Frontiers in Ecology and Evolution
  • Zacchaeus G Compson + 4 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 14652
  • 10.21105/joss.01686
Welcome to the Tidyverse
  • Nov 21, 2019
  • Journal of Open Source Software
  • Hadley Wickham + 23 more

  • Cite Count Icon 667
  • 10.1017/s1464793100005595
Coincidence, coevolution, or causation? DNA content, cell size, and the C-value enigma.
  • Feb 1, 2001
  • Biological Reviews of the Cambridge Philosophical Society
  • T Ryan Gregory

  • Open Access Icon
  • Cite Count Icon 226
  • 10.1111/mec.14776
The choice of universal primers and the characteristics of the species mixture determine when DNA metabarcoding can be quantitative.
  • Jul 9, 2018
  • Molecular Ecology
  • Josep Piñol + 2 more

  • Open Access Icon
  • Cite Count Icon 37
  • 10.1111/2041-210x.13265
Semi‐quantitative characterisation of mixed pollen samples using MinION sequencing and Reverse Metagenomics (RevMet)
  • Aug 6, 2019
  • Methods in Ecology and Evolution
  • Ned Peel + 8 more

  • Open Access Icon
  • Cite Count Icon 15
  • 10.1002/aps3.1034
Algorithms and strategies in short-read shotgun metagenomic reconstruction of plant communities.
  • Mar 1, 2018
  • Applications in Plant Sciences
  • Robert S Harbert

  • Open Access Icon
  • Cite Count Icon 342
  • 10.1016/j.watres.2018.03.003
Implementation options for DNA-based identification into ecological status assessment under the European Water Framework Directive
  • Mar 20, 2018
  • Water Research
  • Daniel Hering + 16 more

CitationsShowing 10 of 13 papers
  • Research Article
  • 10.1002/fsh3.70023
An Integrated Strategy Combining Shotgun Metabarcoding With Conventional Methods for Comprehensive Authentication of Food‐Medicine Homologous Species and Fungal Contamination Detection in Yangyin Qingfei Wan
  • May 29, 2025
  • Food Safety and Health
  • Yu Tian + 9 more

ABSTRACTOTC medicines frequently incorporate species with “food‐medicine homology”, navigating regulatory gray areas, particularly concerning fungal pathogens and adulterated bioactive components that pose health risks to consumers. This research employed a hybrid approach combining shotgun metabarcoding with conventional methods (microscopic identification, thin‐layer chromatography [TLC] and high‐performance liquid chromatography [HPLC]) to authenticate ingredients and assess fungal contamination in Yangyin Qingfei Wan (YYQFW). Analysis encompassed two mock samples and 19 commercial YYQFW samples. Conventional methods confirmed adherence to Chinese Pharmacopeia standards: microscopic analysis identified essential tissue structures, TLC detected key compounds (paeonol, paeoniflorin), and HPLC quantified paeonol content (6.2–9.1 mg/pill), surpassing the standard (5.8 mg/pill). Shotgun metabarcoding retrieved sequences from four DNA barcoding regions (ITS2, psbA‐trnH, matK, and rbcL), yielding 350, 58, 49, and 53 operational taxonomic units (OTUs), respectively. All labeled ingredients were authenticated via DNA barcodes matching reference databases. Importantly, ITS2 analysis identified 53 fungal OTUs (17 genera, predominantly Aspergillus, Fusarium, and Cladosporium), indicating potential mycotoxin presence. This research underscores the efficacy of integrating methods to ensure the reliability of OTC medications containing food‐medicine homologous ingredients, which is crucial given their widespread clinical use. Furthermore, ITS2 emerges as the optimal barcode sequence in shotgun metabarcoding studies, verifying ingredient authenticity and detecting fungal contaminants effectively.

  • Open Access Icon
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.3897/mbmg.7.112290
Abundance estimation with DNA metabarcoding – recent advancements for terrestrial arthropods
  • Nov 23, 2023
  • Metabarcoding and Metagenomics
  • Wiebke Sickel + 4 more

Biodiversity is declining at alarming rates worldwide and large-scale monitoring is urgently needed to understand changes and their drivers. While classical taxonomic identification of species is time and labour intensive, the combination with DNA-based methods could upscale monitoring activities to achieve larger spatial coverage and increased sampling effort. However, challenges remain for DNA-based methods when the number of individuals per species and/or biomass estimates are required. Several methodological advancements exist to improve the potential of DNA metabarcoding for abundance analysis, which however need further evaluation. Here, we discuss laboratory, as well as some bioinformatic adjustments to DNA metabarcoding workflows regarding their potential to achieve species abundance estimation from arthropod community samples. Our review includes pre-laboratory processing methods such as specimen photography, laboratory methods such as the use of spike-in DNA as an internal standard and bioinformatic advancements like correction factors. We conclude that specimen photography coupled with DNA metabarcoding currently promises the greatest potential to achieve estimates of the number of individuals per species and biomass estimates, but that approaches such as spike-ins and correction factors are promising methods to pursue further.

  • Research Article
  • 10.1038/s41598-025-18936-5
A novel two-step metabarcoding approach improves soil microbiome biodiversity assessment.
  • Sep 29, 2025
  • Scientific reports
  • Marcin Musiałowski + 3 more

The foundation of microbial ecology research is Next-Generation Sequencing (NGS), which allows for reconstruction of the soil microbiome taxonomical structure and the calculation of biodiversity metrics. However, obtaining reliable data on soil biodiversity poses several challenges, with accurate primer selection being one of the most critical. While 16S rDNA primers are widely used for their ability to broadly target bacterial communities, they can introduce biases. These primers may preferentially amplify certain bacterial groups, leading to a skewed representation of the microbial diversity in soil samples. To overcome the bias, we developed a novel, Two-Step Metabarcoding (TSM) approach to obtain more accurate and detailed data on soil microbiome structure and biodiversity. The first step involved sequencing of amplicons generated using universal 16S rDNA primers, provided an initial overview of the microbial community, and allowed the identification of key taxonomical groups. In the second step, we employed sequencing of amplicons generated with taxa-specific primers designed for the most abundant phyla in the community. We used the obtained data for a more reliable reconstruction of microbiome taxonomic structure and biodiversity. This two-step approach ensures a thorough exploration of the soil microbiome and promises to enhance our understanding of soil microbial dynamics and ecology.

  • Research Article
  • 10.1016/j.micres.2025.128073
Rootstocks and drought stress impact the composition and functionality of grapevine rhizosphere bacterial microbiota.
  • Apr 1, 2025
  • Microbiological research
  • David Labarga + 5 more

Rootstocks and drought stress impact the composition and functionality of grapevine rhizosphere bacterial microbiota.

  • Open Access Icon
  • Research Article
  • Cite Count Icon 7
  • 10.1111/icad.12726
EDNA for monitoring and conserving terrestrial arthropods: Insights from a systematic map and barcode repositories assessments
  • Mar 8, 2024
  • Insect Conservation and Diversity
  • Camila Leandro + 2 more

Abstract In the past decade, environmental DNA (eDNA) assays have become an essential tool to investigate species presence with samples from the environment instead of collected specimens. eDNA sampling techniques have proved their worth in freshwater and marine studies; now, some trends emerge for their use in terrestrial habitats and particularly to study arthropods. After a systematic review of the literature, we illustrate and analyse the diversity of such studies and discuss their benefits and drawbacks. We identified the most relevant research themes and focused on (i) the taxa and environmental sample types targeted and (ii) the details of the survey scheme. In parallel, we also assessed the available number of sequences from cytochrome c oxidase subunit I (COI), 16S and 18S barcode regions for four major taxa (spiders, centipedes, springtails and insects) in relation to their diversity. We found strong taxonomic and geographic biases regarding coverage per barcode. eDNA research on terrestrial arthropods mainly focuses on insect species that affect humanity in a positive or negative way, and the availability of sequences is much higher for species from temperate‐developed countries than from tropical ones. Moreover, although a high variety of environmental samples are being used, most studies do not assess the barcode completeness of the target taxa nor compare the efficacy of eDNA monitoring technique to other well established and known traditional techniques. Careful workflow designs and comparisons are needed before giving any management or conservation advice as eDNA monitoring does not come without error. Strengths and weaknesses of eDNA assays for conservation are discussed.

  • Open Access Icon
  • Preprint Article
  • Cite Count Icon 6
  • 10.1101/2023.01.23.525240
The sound of restored soil: Measuring soil biodiversity in a forest restoration chronosequence with ecoacoustics
  • Jan 23, 2023
  • Jake M Robinson + 2 more

Abstract Forest restoration requires monitoring to assess changes in above- and below-ground communities, which is challenging due to practical and resource limitations. With emerging sound recording technologies, ecological acoustic survey methods—also known as ‘ecoacoustics’—are increasingly available. These provide a rapid, effective, and non-intrusive means of monitoring biodiversity. Above-ground ecoacoustics is increasingly widespread, but soil ecoacoustics has yet to be utilised in restoration despite its demonstrable effectiveness at detecting meso- and macrofauna acoustic signals. This study applied ecoacoustic tools and indices (Acoustic Complexity Index, Normalised Difference Soundscape Index, and Bioacoustic Index) to measure above- and below-ground biodiversity in a forest restoration chronosequence. We hypothesised that higher acoustic complexity, diversity and high-frequency to low-frequency ratio would be detected in restored forest plots. We collectedn= 198 below-ground samples andn= 180 ambient and controlled samples from three recently degraded (within 10 years) and three restored (30-51 years ago) deciduous forest plots across three monthly visits. We used passive acoustic monitoring to record above-ground biological sounds and a below-ground sampling device and sound-attenuation chamber to record soil communities. We found that restored plot acoustic complexity and diversity were higher in the sound-attenuation chamber soil but notin situor above-ground samples. Moreover, we found that restored plots had a significantly greater high-frequency to low-frequency ratio for soil, but no such association for above-ground samples. Our results suggest that ecoacoustics has the potential to monitor below-ground biodiversity, adding to the restoration ecologist’s toolkit and supporting global ecosystem recovery.Implications for PracticeThis is the first known study to assess the sounds of soil biodiversity in a forest restoration context, paving the way for more comprehensive studies and practical applications to support global ecosystem recovery.Soil ecoacoustics has the potential to support restoration ecology/biodiversity assessments, providing a minimally intrusive, cost-effective and rapid surveying tool. The methods are also relatively simple to learn and apply.Ecoacoustics can contribute toward overcoming the profound challenge of quantifying the effectiveness (i.e., success) of forest restoration interventions in reinstating target species, functions and so-called ‘services’ and reducing disturbance.

  • Book Chapter
  • Cite Count Icon 5
  • 10.1016/bs.aecr.2023.09.002
A roadmap for biomonitoring in the 21st century: Merging methods into metrics via ecological networks
  • Jan 1, 2023
  • Jordan P Cuff + 8 more

A roadmap for biomonitoring in the 21st century: Merging methods into metrics via ecological networks

  • Open Access Icon
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.1038/s42003-023-05621-4
The MetaInvert soil invertebrate genome resource provides insights into below-ground biodiversity and evolution
  • Dec 8, 2023
  • Communications Biology
  • Gemma Collins + 20 more

Soil invertebrates are among the least understood metazoans on Earth. Thus far, the lack of taxonomically broad and dense genomic resources has made it hard to thoroughly investigate their evolution and ecology. With MetaInvert we provide draft genome assemblies for 232 soil invertebrate species, representing 14 common groups and 94 families. We show that this data substantially extends the taxonomic scope of DNA- or RNA-based taxonomic identification. Moreover, we confirm that theories of genome evolution cannot be generalised across evolutionarily distinct invertebrate groups. The soil invertebrate genomes presented here will support the management of soil biodiversity through molecular monitoring of community composition and function, and the discovery of evolutionary adaptations to the challenges of soil conditions.

  • Open Access Icon
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.1111/afe.12628
Metabarcoding advances agricultural invertebrate biomonitoring by enhancing resolution, increasing throughput and facilitating network inference
  • May 8, 2024
  • Agricultural and Forest Entomology
  • Ben S J Hawthorne + 3 more

Abstract Biomonitoring of agriculturally important insects is increasingly vital given our need to understand: (a) the severity of impacts by pests and pathogens on crop yield and health and (b) the impact of environmental change and land management on insects, in line with sustainable development and global conservation targets. Traditional entomological traps remain an important part of the biomonitoring toolbox, but sample processing is laborious and introduces latency, and accuracy can be variable. The integration of molecular techniques such as environmental DNA and DNA metabarcoding into insect biomonitoring has gained increasing attention, but the advantages of doing so, the kind of data this can generate, and how easily and effectively molecular analyses can be integrated with the diverse types of entomological traps currently used remains relatively unclear. In this review, we examine how combining DNA metabarcoding with a range of conventional and unconventional entomological sampling techniques can advance biomonitoring in a way that is useful to researchers and practitioners. We highlight some of the key challenges and how to mitigate them, using examples of its integration with different sampling methods from the literature (e.g., interception, pitfall and sticky traps) to demonstrate efficacy and suitability. We discuss how metabarcoding data can be used to infer ecological networks, emphasizing the importance of this as a framework for understanding species interactions and ecosystem functioning for more effective and descriptive biomonitoring. Finally, future advances in biomonitoring are highlighted, alongside recommendations of best practice for researchers both new to and experienced in invertebrate biomonitoring with metabarcoding.

  • Preprint Article
  • 10.1101/2025.07.21.665925
Two decades of compositional restructuring of soil biodiversity in Germany despite stable α- and β-diversity indices
  • Jul 24, 2025
  • Judith Paetsch + 8 more

Abstract Soil ecosystems host some of the most taxonomically and functionally diverse biological communities on Earth, yet long-term trends in their biodiversity remain poorly understood. Here, we analysed soil biodiversity dynamics over 20 years with samples archived in the German Environmental Specimen Bank. We assessed temporal and spatial patterns in α-diversity and β-diversity with shotgun metagenomics across bacteria, fungi, and metazoa. We found no statistically significant temporal trends in α-diversity for any group. Total β-diversity also appeared temporally stable. However, decomposing β-diversity into its balanced variation and abundance gradients revealed taxon-specific compositional restructuring. Bacterial and fungal communities showed signs of compositional homogenisation, while metazoan communities remained more stable. Spatial structuring was pronounced across all groups. Land use emerged as a key spatial predictor of community composition for bacteria and fungi, and geographic locality for metazoans. Our findings show that apparent stability in standard biodiversity indices may mask significant underlying community change. This highlights the need for integrative, taxonomically inclusive approaches to biodiversity monitoring. The combination of environmental specimen banking with metagenomic sequencing offers a powerful framework for uncovering hidden biodiversity trends in soil ecosystems and identifying the drivers of ecological reorganisation under global change.

Similar Papers
  • Research Article
  • Cite Count Icon 8
  • 10.1007/s12561-016-9148-x
A Model-Based Approach For Species Abundance Quantification Based On Shotgun Metagenomic Data.
  • Jun 1, 2017
  • Statistics in Biosciences
  • Eric Z Chen + 2 more

The human microbiome, which includes the collective microbes residing in or on the human body, has a profound influence on the human health. DNA sequencing technology has made the large-scale human microbiome studies possible by using shotgun metagenomic sequencing. One important aspect of data analysis of such metagenomic data is to quantify the bacterial abundances based on the metagenomic sequencing data. Existing methods almost always quantify such abundances one sample at a time, which ignore certain systematic differences in read coverage along the genomes due to GC contents, copy number variation and the bacterial origin of replication. In order to account for such differences in read counts, we propose a multi-sample Poisson model to quantify microbial abundances based on read counts that are assigned to species-specific taxonomic markers. Our model takes into account the marker-specific effects when normalizing the sequencing count data in order to obtain more accurate quantification of the species abundances. Compared to currently available methods on simulated data and real data sets, our method has demonstrated an improved accuracy in bacterial abundance quantification, which leads to more biologically interesting results from downstream data analysis.

  • Research Article
  • Cite Count Icon 66
  • 10.1038/ismej.2013.21
Waiting for the human intestinal Eukaryotome
  • Feb 14, 2013
  • The ISME Journal
  • Lee O’Brien Andersen + 2 more

Waiting for the human intestinal Eukaryotome

  • Research Article
  • 10.1128/msystems.00413-25
Tracing non-fungal eukaryotic diversity via shotgun metagenomes in the complex mudflat intertidal zones.
  • Jun 12, 2025
  • mSystems
  • He Han + 11 more

Eukaryotes, both micro- and macro-, constitute the dominant component of Earth's biosphere visible to the naked eye. Although relatively big in organismal size, tracing eukaryotic diversity in complex environments is not easy. For example, they may actively escape from sampling and be physically absent from the collected samples. In this study, we strived to recover non-fungal eukaryotic DNA sequences from typical shotgun metagenomes in the complex mudflat intertidal zones. Multiple recently developed approaches for identifying eukaryotic sequences from shotgun metagenomes were comparatively assessed. Considering the low overlap among different approaches, an integrative workflow was proposed. The integrative workflow was then used to recover the eukaryotic communities in complex intertidal sediments. The temporal dynamics of intertidal eukaryotic communities were investigated through a time-series sampling effort. Thirty-four non-fungal eukaryotic phyla were detected from 36 shotgun metagenomes. Clear temporal variation in relative abundance was observed for eukaryotic genera such as Timema and Navicula. Strong temporal turnover of intertidal eukaryotic communities was observed. By comparing to 18S rRNA gene amplicon sequencing, dramatically different community profiles were observed between these two approaches. However, the temporal patterns for intertidal eukaryotic communities recovered by both approaches were generally comparable. This study provides valuable technical insights into the recovery of non-fungal eukaryotic information from complex environments and demonstrates an alternative route for reusing the massive metagenomic data sets generated in the past and future.IMPORTANCEEukaryotes represent the dominant component visible to the naked eye and contribute to the primary biomass in the Earth's biosphere. Yet, tracing the eukaryotic diversity in complex environments remains difficult, as they can actively move around and escape from sampling. Here, using the intertidal sediments as an example, we strived to retrieve non-fungal eukaryotic sequences from typical shotgun metagenomes. Compared to 18S rRNA gene amplicon sequencing, the shotgun metagenome-based approach resolved dramatically different eukaryotic community profiles, though comparable ecological patterns could be observed. This study paves an alternative way for utilizing shotgun metagenomic data to recover non-fungal eukaryotic information in complex environments, demonstrating significant potential for environmental monitoring and biodiversity investigations.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 69
  • 10.1186/s40168-019-0657-y
Mining, analyzing, and integrating viral signals from metagenomic data
  • Mar 19, 2019
  • Microbiome
  • Tingting Zheng + 10 more

BackgroundViruses are important components of microbial communities modulating community structure and function; however, only a couple of tools are currently available for phage identification and analysis from metagenomic sequencing data. Here we employed the random forest algorithm to develop VirMiner, a web-based phage contig prediction tool especially sensitive for high-abundances phage contigs, trained and validated by paired metagenomic and phagenomic sequencing data from the human gut flora.ResultsVirMiner achieved 41.06% ± 17.51% sensitivity and 81.91% ± 4.04% specificity in the prediction of phage contigs. In particular, for the high-abundance phage contigs, VirMiner outperformed other tools (VirFinder and VirSorter) with much higher sensitivity (65.23% ± 16.94%) than VirFinder (34.63% ± 17.96%) and VirSorter (18.75% ± 15.23%) at almost the same specificity. Moreover, VirMiner provides the most comprehensive phage analysis pipeline which is comprised of metagenomic raw reads processing, functional annotation, phage contig identification, and phage-host relationship prediction (CRISPR-spacer recognition) and supports two-group comparison when the input (metagenomic sequence data) includes different conditions (e.g., case and control). Application of VirMiner to an independent cohort of human gut metagenomes obtained from individuals treated with antibiotics revealed that 122 KEGG orthology and 118 Pfam groups had significantly differential abundance in the pre-treatment samples compared to samples at the end of antibiotic administration, including clustered regularly interspaced short palindromic repeats (CRISPR), multidrug resistance, and protein transport. The VirMiner webserver is available at http://sbb.hku.hk/VirMiner/.ConclusionsWe developed a comprehensive tool for phage prediction and analysis for metagenomic samples. Compared to VirSorter and VirFinder—the most widely used tools—VirMiner is able to capture more high-abundance phage contigs which could play key roles in infecting bacteria and modulating microbial community dynamics.Trial registrationThe European Union Clinical Trials Register, EudraCT Number: 2013-003378-28. Registered on 9 April 2014

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.1128/msystems.00925-22
TaxiBGC: a Taxonomy-Guided Approach for Profiling Experimentally Characterized Microbial Biosynthetic Gene Clusters and Secondary Metabolite Production Potential in Metagenomes
  • Nov 15, 2022
  • mSystems
  • Vinod K Gupta + 8 more

ABSTRACTBiosynthetic gene clusters (BGCs) in microbial genomes encode bioactive secondary metabolites (SMs), which can play important roles in microbe-microbe and host-microbe interactions. Given the biological significance of SMs and the current profound interest in the metabolic functions of microbiomes, the unbiased identification of BGCs from high-throughput metagenomic data could offer novel insights into the complex chemical ecology of microbial communities. Currently available tools for predicting BGCs from shotgun metagenomes have several limitations, including the need for computationally demanding read assembly, predicting a narrow breadth of BGC classes, and not providing the SM product. To overcome these limitations, we developed taxonomy-guided identification of biosynthetic gene clusters (TaxiBGC), a command-line tool for predicting experimentally characterized BGCs (and inferring their known SMs) in metagenomes by first pinpointing the microbial species likely to harbor them. We benchmarked TaxiBGC on various simulated metagenomes, showing that our taxonomy-guided approach could predict BGCs with much-improved performance (mean F1 score, 0.56; mean PPV score, 0.80) compared with directly identifying BGCs by mapping sequencing reads onto the BGC genes (mean F1 score, 0.49; mean PPV score, 0.41). Next, by applying TaxiBGC on 2,650 metagenomes from the Human Microbiome Project and various case-control gut microbiome studies, we were able to associate BGCs (and their SMs) with different human body sites and with multiple diseases, including Crohn’s disease and liver cirrhosis. In all, TaxiBGC provides an in silico platform to predict experimentally characterized BGCs and their SM production potential in metagenomic data while demonstrating important advantages over existing techniques.IMPORTANCE Currently available bioinformatics tools to identify BGCs from metagenomic sequencing data are limited in their predictive capability or ease of use to even computationally oriented researchers. We present an automated computational pipeline called TaxiBGC, which predicts experimentally characterized BGCs (and infers their known SMs) in shotgun metagenomes by first considering the microbial species source. Through rigorous benchmarking techniques on simulated metagenomes, we show that TaxiBGC provides a significant advantage over existing methods. When demonstrating TaxiBGC on thousands of human microbiome samples, we associate BGCs encoding bacteriocins with different human body sites and diseases, thereby elucidating a possible novel role of this antibiotic class in maintaining the stability of microbial ecosystems throughout the human body. Furthermore, we report for the first time gut microbial BGC associations shared among multiple pathologies. Ultimately, we expect our tool to facilitate future investigations into the chemical ecology of microbial communities across diverse niches and pathologies.

  • Research Article
  • Cite Count Icon 63
  • 10.1371/journal.pone.0167870
Strain-Level Discrimination of Shiga Toxin-Producing Escherichia coli in Spinach Using Metagenomic Sequencing.
  • Dec 8, 2016
  • PLOS ONE
  • Susan R Leonard + 3 more

Consumption of fresh bagged spinach contaminated with Shiga toxin-producing Escherichia coli (STEC) has led to severe illness and death; however current culture-based methods to detect foodborne STEC are time consuming. Since not all STEC strains are considered pathogenic to humans, it is crucial to incorporate virulence characterization of STEC in the detection method. In this study, we assess the comprehensiveness of utilizing a shotgun metagenomics approach for detection and strain-level identification by spiking spinach with a variety of genomically disparate STEC strains at a low contamination level of 0.1 CFU/g. Molecular serotyping, virulence gene characterization, microbial community analysis, and E. coli core gene single nucleotide polymorphism (SNP) analysis were performed on metagenomic sequence data from enriched samples. It was determined from bacterial community analysis that E. coli, which was classified at the phylogroup level, was a major component of the population in most samples. However, in over half the samples, molecular serotyping revealed the presence of indigenous E. coli which also contributed to the percent abundance of E. coli. Despite the presence of additional E. coli strains, the serotype and virulence genes of the spiked STEC, including correct Shiga toxin subtype, were detected in 94% of the samples with a total number of reads per sample averaging 2.4 million. Variation in STEC abundance and/or detection was observed in replicate spiked samples, indicating an effect from the indigenous microbiota during enrichment. SNP analysis of the metagenomic data correctly placed the spiked STEC in a phylogeny of related strains in cases where the indigenous E. coli did not predominate in the enriched sample. Also, for these samples, our analysis demonstrates that strain-level phylogenetic resolution is possible using shotgun metagenomic data for determining the genomic relatedness of a contaminating STEC strain to other closely related E. coli.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1757
  • 10.1186/s40168-018-0541-1
MetaWRAP\u2014a flexible pipeline for genome-resolved metagenomic data analysis
  • Sep 15, 2018
  • Microbiome
  • Gherman V Uritskiy + 2 more

BackgroundThe study of microbiomes using whole-metagenome shotgun sequencing enables the analysis of uncultivated microbial populations that may have important roles in their environments. Extracting individual draft genomes (bins) facilitates metagenomic analysis at the single genome level. Software and pipelines for such analysis have become diverse and sophisticated, resulting in a significant burden for biologists to access and use them. Furthermore, while bin extraction algorithms are rapidly improving, there is still a lack of tools for their evaluation and visualization.ResultsTo address these challenges, we present metaWRAP, a modular pipeline software for shotgun metagenomic data analysis. MetaWRAP deploys state-of-the-art software to handle metagenomic data processing starting from raw sequencing reads and ending in metagenomic bins and their analysis. MetaWRAP is flexible enough to give investigators control over the analysis, while still being easy-to-install and easy-to-use. It includes hybrid algorithms that leverage the strengths of a variety of software to extract and refine high-quality bins from metagenomic data through bin consolidation and reassembly. MetaWRAP’s hybrid bin extraction algorithm outperforms individual binning approaches and other bin consolidation programs in both synthetic and real data sets. Finally, metaWRAP comes with numerous modules for the analysis of metagenomic bins, including taxonomy assignment, abundance estimation, functional annotation, and visualization.ConclusionsMetaWRAP is an easy-to-use modular pipeline that automates the core tasks in metagenomic analysis, while contributing significant improvements to the extraction and interpretation of high-quality metagenomic bins. The bin refinement and reassembly modules of metaWRAP consistently outperform other binning approaches. Each module of metaWRAP is also a standalone component, making it a flexible and versatile tool for tackling metagenomic shotgun sequencing data. MetaWRAP is open-source software available at https://github.com/bxlab/metaWRAP.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1186/s12864-019-5467-x
Estimating the total genome length of a metagenomic sample using k-mers
  • Apr 1, 2019
  • BMC Genomics
  • Kui Hua + 1 more

BackgroundMetagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage.ResultsAs an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses.ConclusionsWe proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.3390/ijms18102124
A Massively Parallel Sequence Similarity Search for Metagenomic Sequencing Data
  • Oct 11, 2017
  • International Journal of Molecular Sciences
  • Masanori Kakuta + 4 more

Sequence similarity searches have been widely used in the analyses of metagenomic sequencing data. Finding homologous sequences in a reference database enables the estimation of taxonomic and functional characteristics of each query sequence. Because current metagenomic sequencing data consist of a large number of nucleotide sequences, the time required for sequence similarity searches account for a large proportion of the total time. This time-consuming step makes it difficult to perform large-scale analyses. To analyze large-scale metagenomic data, such as those found in the human oral microbiome, we developed GHOST-MP (Genome-wide HOmology Search Tool on Massively Parallel system), a parallel sequence similarity search tool for massively parallel computing systems. This tool uses a fast search algorithm based on suffix arrays of query and database sequences and a hierarchical parallel search to accelerate the large-scale sequence similarity search of metagenomic sequencing data. The parallel computing efficiency and the search speed of this tool were evaluated. GHOST-MP was shown to be scalable over 10,000 CPU (Central Processing Unit) cores, and achieved over 80-fold acceleration compared with mpiBLAST using the same computational resources. We applied this tool to human oral metagenomic data, and the results indicate that the oral cavity, the oral vestibule, and plaque have different characteristics based on the functional gene category.

  • Research Article
  • Cite Count Icon 94
  • 10.1186/s40793-019-0347-1
The impact of sequencing depth on the inferred taxonomic composition and AMR gene content of metagenomic samples
  • Oct 24, 2019
  • Environmental Microbiome
  • H Soon Gweon + 16 more

BackgroundShotgun metagenomics is increasingly used to characterise microbial communities, particularly for the investigation of antimicrobial resistance (AMR) in different animal and environmental contexts. There are many different approaches for inferring the taxonomic composition and AMR gene content of complex community samples from shotgun metagenomic data, but there has been little work establishing the optimum sequencing depth, data processing and analysis methods for these samples. In this study we used shotgun metagenomics and sequencing of cultured isolates from the same samples to address these issues. We sampled three potential environmental AMR gene reservoirs (pig caeca, river sediment, effluent) and sequenced samples with shotgun metagenomics at high depth (~ 200 million reads per sample). Alongside this, we cultured single-colony isolates of Enterobacteriaceae from the same samples and used hybrid sequencing (short- and long-reads) to create high-quality assemblies for comparison to the metagenomic data. To automate data processing, we developed an open-source software pipeline, ‘ResPipe’.ResultsTaxonomic profiling was much more stable to sequencing depth than AMR gene content. 1 million reads per sample was sufficient to achieve < 1% dissimilarity to the full taxonomic composition. However, at least 80 million reads per sample were required to recover the full richness of different AMR gene families present in the sample, and additional allelic diversity of AMR genes was still being discovered in effluent at 200 million reads per sample. Normalising the number of reads mapping to AMR genes using gene length and an exogenous spike of Thermus thermophilus DNA substantially changed the estimated gene abundance distributions. While the majority of genomic content from cultured isolates from effluent was recoverable using shotgun metagenomics, this was not the case for pig caeca or river sediment.ConclusionsSequencing depth and profiling method can critically affect the profiling of polymicrobial animal and environmental samples with shotgun metagenomics. Both sequencing of cultured isolates and shotgun metagenomics can recover substantial diversity that is not identified using the other methods. Particular consideration is required when inferring AMR gene content or presence by mapping metagenomic reads to a database. ResPipe, the open-source software pipeline we have developed, is freely available (https://gitlab.com/hsgweon/ResPipe).

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.dib.2020.106226
Shotgun metagenomic data of microbiomes on plastic fabrics exposed to harsh tropical environments.
  • Aug 24, 2020
  • Data in brief
  • Osman Radwan + 1 more

The development of more affordable high-throughput DNA sequencing technologies and powerful bioinformatics is making of shotgun metagenomics a common tool for effective characterization of microbiomes and robust functional genomics. A shotgun metagenomic approach was applied in the characterization of microbial communities associated with plasticized fabric materials exposed to a harsh tropical environment for 14 months. High-throughput sequencing of TruSeq paired-end libraries was conducted using a whole-genome shotgun (WGS) approach on an Illumina HiSeq2000 platform generating 100 bp reads. A multifaceted bioinformatics pipeline was developed and applied to conduct quality control and trimming of raw reads, microbial classification, assembly of multi-microbial genomes, binning of assembled contigs to individual genomes, and prediction of microbial genes and proteins. The bioinformatic analysis of the large 161 Gb sequence dataset generated 3,314,688 contigs and 120 microbial genomes. The raw metagenomic data and the detailed description of the bioinformatics pipeline applied in data analysis provide an important resource for the genomic characterization of microbial communities associated with biodegraded plastic fabric materials. The raw shotgun metagenomics sequence data of microbial communities on plastic fabric materials have been deposited in MG-RAST (https://www.mg-rast.org/) under accession numbers: mgm4794685.3–mgm4794690.3. The datasets and raw data presented here were associated with the main research work “Metagenomic characterization of microbial communities on plasticized fabric materials exposed to harsh tropical environments” (Radwan et al., 2020).

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 21
  • 10.3390/v10090479
Insights into the Human Virome Using CRISPR Spacers from Microbiomes.
  • Sep 7, 2018
  • Viruses
  • Claudio Hidalgo-Cantabrana + 2 more

Due to recent advances in next-generation sequencing over the past decade, our understanding of the human microbiome and its relationship to health and disease has increased dramatically. Yet, our insights into the human virome, and its interplay with important microbes that impact human health, is relatively limited. Prokaryotic and eukaryotic viruses are present throughout the human body, comprising a large and diverse population which influences several niches and impacts our health at various body sites. The presence of prokaryotic viruses like phages, has been documented at many different body sites, with the human gut being the richest ecological niche. Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) and associated proteins constitute the adaptive immune system of bacteria, which prevents attack by invasive nucleic acid. CRISPR-Cas systems function by uptake and integration of foreign genetic element sequences into the CRISPR array, which constitutes a genomic archive of iterative vaccination events. Consequently, CRISPR spacers can be investigated to reconstruct interplay between viruses and bacteria, and metagenomic sequencing data can be exploited to provide insights into host-phage interactions within a niche. Here, we show how the CRISPR spacer content of commensal and pathogenic bacteria can be used to determine the evidence of their phage exposure. This framework opens new opportunities for investigating host-virus dynamics in metagenomic data, and highlights the need to dedicate more efforts for virome sampling and sequencing.

  • Research Article
  • Cite Count Icon 7
  • 10.7717/peerj.13404
DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
  • Jun 8, 2022
  • PeerJ
  • Yunmeng Chu + 4 more

Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP.

  • PDF Download Icon
  • Front Matter
  • Cite Count Icon 2
  • 10.3389/fpls.2016.00433
Will Benchtop Sequencers Resolve the Sequencing Trade-off in Plant Genetics?
  • Apr 6, 2016
  • Frontiers in Plant Science
  • Alex D Twyford

Will Benchtop Sequencers Resolve the Sequencing Trade-off in Plant Genetics?

  • Research Article
  • Cite Count Icon 66
  • 10.4056/sigs.651139
The JCVI standard operating procedure for annotating prokaryotic metagenomic shotgun sequencing data
  • Mar 30, 2010
  • Standards in Genomic Sciences
  • David M Tanenbaum + 11 more

The JCVI metagenomics analysis pipeline provides for the efficient and consistent annotation of shotgun metagenomics sequencing data for sampling communities of prokaryotic organisms. The process can be equally applied to individual sequence reads from traditional Sanger capillary electrophoresis sequences, newer technologies such as 454 pyrosequencing, or sequence assemblies derived from one or more of these data types. It includes the analysis of both coding and non-coding genes, whether full-length or, as is often the case for shotgun metagenomics, fragmentary. The system is designed to provide the best-supported conservative functional annotation based on a combination of trusted homology-based scientific evidence and computational assertions and an annotation value hierarchy established through extensive manual curation. The functional annotation attributes assigned by this system include gene name, gene symbol, GO terms, EC numbers, and JCVI functional role categories.

More from: Ecology and evolution
  • New
  • Research Article
  • 10.1002/ece3.72314
Inferring Fine‐Scale Mutation and Recombination Rate Maps in Aye‐Ayes (Daubentonia madagascariensis)
  • Nov 3, 2025
  • Ecology and Evolution
  • Vivak Soni + 4 more

  • New
  • Research Article
  • 10.1002/ece3.72448
Projected Expansion and Northwestern Shift of Wikstroemia indica Suitable Habitats in China Under Multiple Climate Change Scenarios: An Optimized MaxEnt Approach
  • Nov 1, 2025
  • Ecology and Evolution
  • Yangzhou Xiang + 7 more

  • New
  • Research Article
  • 10.1002/ece3.72421
Will the Establishment of a National Park Protect More Suitable Habitats for the Qinling Golden Snub‐Nosed Monkey?
  • Nov 1, 2025
  • Ecology and Evolution
  • Tong Wu + 14 more

  • New
  • Research Article
  • 10.1002/ece3.72357
Copy‐Paste Augmentation Improves Automatic Species Identification in Camera Trap Images
  • Nov 1, 2025
  • Ecology and Evolution
  • Cédric S Mesnage + 6 more

  • New
  • Research Article
  • 10.1002/ece3.72404
Distribution Pattern of Ants in Huanglianshan National Nature Reserve From Yunnan, China
  • Nov 1, 2025
  • Ecology and Evolution
  • Xingze Li + 4 more

  • New
  • Research Article
  • 10.1002/ece3.72410
Predictive Distribution Modeling of the Medicinal Leech Hirudo verbana Carena, 1820 (Hirudinea, Hirudinidae) in Sicily: Implications for Conservation
  • Nov 1, 2025
  • Ecology and Evolution
  • Mirko Liuzzo + 2 more

  • New
  • Research Article
  • 10.1002/ece3.72416
Trophic Relationships of Aquatic Species Offer Valuable Insights Into Shallow Lake Ecosystem Recovery
  • Nov 1, 2025
  • Ecology and Evolution
  • Yajun Qiao + 9 more

  • New
  • Research Article
  • 10.1002/ece3.72403
Spatial Phylogenetics Reveals Endemism Hotspots and Conservation Priorities in Chinese Asteraceae
  • Nov 1, 2025
  • Ecology and Evolution
  • Xinyi Zheng + 7 more

  • New
  • Research Article
  • 10.1002/ece3.72457
Projected Spatial–Temporal Habitat Patterns of the Lady Amherst's Pheasant (Chrysolophus amherstiae) Under Climate and Land Use Change
  • Nov 1, 2025
  • Ecology and Evolution
  • Xue Sun + 4 more

  • New
  • Research Article
  • 10.1002/ece3.72429
Identification of Ecological Corridors for Semi‐Aquatic Vertebrates: A Case of the Eurasian Otter in Northeast China
  • Nov 1, 2025
  • Ecology and Evolution
  • Qingyi Wang + 5 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon
Setting-up Chat
Loading Interface