Global biogeography of N 2 -fixing microbes: nifH amplicon database and analytics workflow
Abstract. Marine dinitrogen (N2) fixation is a globally significant biogeochemical process carried out by a specialized group of prokaryotes (diazotrophs), yet our understanding of their ecology is constantly evolving. Although marine N2 fixation is often ascribed to cyanobacterial diazotrophs, indirect evidence suggests that non-cyanobacterial diazotrophs (NCDs) might also be important. One widely used approach for understanding diazotroph diversity and biogeography is polymerase chain reaction (PCR) amplification of a portion of the nifH gene, which encodes a structural component of the N2-fixing enzyme complex, nitrogenase. An array of bioinformatic tools exists to process nifH amplicon data; however, the lack of standardized practices has hindered cross-study comparisons. This has led to a missed opportunity to more thoroughly assess diazotroph diversity and biogeography, as well as their potential contributions to the marine N cycle. To address these knowledge gaps, a bioinformatic workflow was designed that standardizes the processing of nifH amplicon datasets originating from high-throughput sequencing (HTS). Multiple datasets are efficiently and consistently processed with a specialized DADA2 pipeline to identify amplicon sequence variants (ASVs). A series of customizable post-pipeline stages then detect and discard spurious nifH sequences and annotate the subsequent quality-filtered nifH ASVs using multiple reference databases and classification approaches. This newly developed workflow was used to reprocess nearly all publicly available nifH amplicon HTS datasets from marine studies and to generate a comprehensive nifH ASV database containing 9383 ASVs aggregated from 21 studies that represent the diazotrophic populations in the global ocean. For each sample, the database includes physical and chemical metadata obtained from the Simons Collaborative Marine Atlas Project (CMAP). Here we demonstrate the utility of this database for revealing global biogeographical patterns of prominent diazotroph groups and highlight the influence of sea surface temperature. The workflow and nifH ASV database provide a robust framework for studying marine N2 fixation and diazotrophic diversity captured by nifH amplicon HTS. Future datasets that target understudied ocean regions can be added easily, and users can tune parameters and studies included for their specific focus. The workflow and database are available, respectively, on GitHub (https://github.com/jdmagasin/nifH-ASV-workflow, last access: 21 January 2025; Morando et al., 2024c) and Figshare (https://doi.org/10.6084/m9.figshare.23795943.v2; Morando et al., 2024b).
- Research Article
2
- 10.1093/ismeco/ycaf038
- Feb 27, 2025
- ISME Communications
Exploring the diversity of diazotrophs is key for understanding their role in supplying fixed nitrogen that supports marine productivity. A nested PCR assay using the universal primer set nifH1-nifH4, which targets the nitrogenase (nifH) gene, is a widely used approach for studying marine diazotrophs by amplicon sequencing. Metagenomics, direct sequencing of DNA without PCR, has provided complementary views of the diversity of marine diazotrophs. A significant fraction of the metagenome-derived nifH sequences (e.g., Planctomycete- and Proteobacteria-affiliated) were reported to have nucleotide mismatches with the nifH1-nifH4 primers, leading to the suggestion that nifH amplicon sequencing does not detect specific diazotrophic taxa and underrepresents diazotroph diversity. Here, we report that these mismatches are mostly located in a single-base at the 5′-end of the nifH4 primer, which does not impact detection of the nifH genes. This is demonstrated by the presence of the nifH genes that contain the nucleotide mismatches in a recent compilation of global ocean nifH amplicon datasets, with high relative abundances detected in a variety of samples. While the metagenome- and metatranscriptome-derived nifH genes accounted for 4.4% of the total amplicon sequence variants (ASVs) from the global ocean nifH amplicon database, the corresponding ASVs can have high relative abundances (accounting for 47% of the reads in the database). These analyses underscore that nifH amplicon sequencing using the nifH1-nifH4 primers is an important tool for studying diversity of marine diazotrophs, particularly as a complement to metagenomics which can provide taxonomic and metabolic information for some dominant groups.
- Research Article
1
- 10.1111/1755-0998.70023
- Aug 4, 2025
- Molecular Ecology Resources
ABSTRACTAssessing and monitoring genetic diversity is vital for understanding the ecology and evolution of natural populations but is often challenging in animal and plant species due to technically and physically demanding tissue sampling. Although environmental DNA (eDNA) metabarcoding is a promising alternative to the traditional population genetic monitoring based on biological samples, its practical application remains challenging due to spurious sequences present in the amplicon data, even after data processing with the existing sequence filtering and denoising (error correction) methods. Here we developed a novel amplicon filtering approach that can effectively eliminate such spurious amplicon sequence variants (ASVs) in eDNA metabarcoding data. A simple simulation of eDNA metabarcoding processes was performed to understand the patterns of read count (abundance) distributions of true ASVs and their polymerase chain reaction (PCR)‐generated artefacts (i.e., false‐positive ASVs). Based on the simulation results, the approach was developed to estimate the abundance distributions of true and false‐positive ASVs using Gaussian mixture models and to determine a statistically based threshold between them. The developed approach was implemented as an R package, gmmDenoise and evaluated using single‐species metabarcoding datasets in which all or some true ASVs (i.e., haplotypes) were known. Example analyses using community (multi‐species) metabarcoding datasets were also performed to demonstrate how gmmDenoise can be used to derive reliable intraspecific diversity estimates and population genetic inferences from noisy amplicon sequencing data. The gmmDenoise package is freely available in the GitHub repository (https://github.com/YSKoseki/gmmDenoise).
- Preprint Article
1
- 10.22541/au.174313088.80834592/v1
- Mar 28, 2025
Assessing and monitoring genetic diversity is vital for understanding the ecology and evolution of natural populations but is often challenging in animal and plant species due to technically and physically demanding tissue sampling. Although environmental DNA (eDNA) metabarcoding is a promising alternative to the traditional population genetic monitoring based on biological samples, its practical application remains challenging due to spurious sequences present in the amplicon data, even after data processing with the existing sequence filtering and denoising (error correction) methods. Here we developed a novel amplicon filtering approach that can effectively eliminate such spurious amplicon sequence variants (ASVs) in eDNA metabarcoding data. A simple simulation of eDNA metabarcoding processes was performed to understand the patterns of read count (abundance) distributions of true ASVs and their polymerase chain reaction (PCR)-generated artifacts (i.e., false-positive ASVs). Based on the simulation results, the approach was developed to estimate the abundance distributions of true and false-positive ASVs using Gaussian mixture models and to determine a statistically based threshold between them. The developed approach was implemented as an R package, gmmDenoise, and evaluated using single-species eDNA metabarcoding datasets in which all or some true ASVs (i.e., haplotypes) were known. Example analyses using community (multi-species) eDNA datasets were also performed to demonstrate how gmmDenoise can be used to derive reliable intraspecific diversity estimates and population genetic inferences from noisy amplicon sequencing data. The gmmDenoise package is freely available in the GitHub repository (https://github.com/YSKoseki/gmmDenoise).
- Research Article
108
- 10.1128/aem.01512-17
- Jan 31, 2018
- Applied and Environmental Microbiology
The dinitrogenase reductase gene (nifH) is the most widely established molecular marker for the study of nitrogen-fixing prokaryotes in nature. A large number of PCR primer sets have been developed for nifH amplification, and the effective deployment of these approaches should be guided by a rapid, easy-to-use analysis protocol. Bioinformatic analysis of marker gene sequences also requires considerable expertise. In this study, we advance the state of the art for nifH analysis by evaluating nifH primer set performance, developing an improved amplicon sequencing workflow, and implementing a user-friendly bioinformatics pipeline. The developed amplicon sequencing workflow is a three-stage PCR-based approach that uses established technologies for incorporating sample-specific barcode sequences and sequencing adapters. Based on our primer evaluation, we recommend the Ando primer set be used with a modified annealing temperature of 58°C, as this approach captured the largest diversity of nifH templates, including paralog cluster IV/V sequences. To improve nifH sequence analysis, we developed a computational pipeline which infers taxonomy and optionally filters out paralog sequences. In addition, we employed an empirical model to derive optimal operational taxonomic unit (OTU) cutoffs for the nifH gene at the species, genus, and family levels. A comprehensive workflow script named TaxADivA (TAXonomy Assignment and DIVersity Assessment) is provided to ease processing and analysis of nifH amplicons. Our approach is then validated through characterization of diazotroph communities across environmental gradients in beach sands impacted by the Deepwater Horizon oil spill in the Gulf of Mexico, in a peat moss-dominated wetland, and in various plant compartments of a sugarcane field.IMPORTANCE Nitrogen availability often limits ecosystem productivity, and nitrogen fixation, exclusive to prokaryotes, comprises a major source of nitrogen input that sustains food webs. The nifH gene, which codes for the iron protein of the nitrogenase enzyme, is the most widely established molecular marker for the study of nitrogen-fixing microorganisms (diazotrophs) in nature. In this study, a flexible sequencing/analysis pipeline, named TaxADivA, was developed for nifH amplicons produced by Illumina paired-end sequencing, and it enables an inference of taxonomy, performs clustering, and produces output in formats that may be used by programs that facilitate data exploration and analysis. Diazotroph diversity and community composition are linked to ecosystem functioning, and our results advance the phylogenetic characterization of diazotroph communities by providing empirically derived nifH similarity cutoffs for species, genus, and family levels. The utility of our pipeline is validated for diazotroph communities in a variety of ecosystems, including contaminated beach sands, peatland ecosystems, living plant tissues, and rhizosphere soil.
- Research Article
45
- 10.1016/j.ecoleng.2017.02.010
- Feb 11, 2017
- Ecological Engineering
Long-term aromatic rice cultivation effect on frequency and diversity of diazotrophs in its rhizosphere
- Research Article
5
- 10.1016/j.scitotenv.2025.178727
- Feb 1, 2025
- The Science of the total environment
Aiming to gain a general picture of rbcL diversity within freshwater diatom species, this study assembles and analyzes multiple metabarcoding datasets spanning various geographical regions. From these datasets, we inferred >10,000 amplicon sequence variants (ASVs) of 263-bp length. More than half of the 1000 most abundant ASVs were recorded in both Eurasia and N America and there was only limited evidence for continent-specific lineages. The geographical range was extended for some species, illustrating the potential of metabarcoding datasets for such checks. For detailed analysis of intraspecific diversity, 73 freshwater species were selected, corresponding to 360 ASVs assigned phylogenetically. We found notable variation, some species being represented by only one or a few ASVs, while others were represented by a higher number. Furthermore, within species, ASVs exhibited different dominance and distribution patterns, in some cases with a head-tail pattern, in others a more equal spread of abundance or unresolved reticulate relationships. Except for Ulnaria ulna, no geographical structure among species' ASVs was detectable in haplotype networks using the 263-bp rbcL marker. Observed heterogeneity within species was categorized by computing several metrics of genetic variation and classified into three groups, reflecting optimal sampling strategies based on the patterns of intraspecific variation in the 73 target species There was a significant relationship between intraspecific diversity and the traditional separation between 'centric' and 'pennate' diatoms, with centric species exhibiting significantly fewer variants than pennates, possibly because of different plastid inheritance patterns.
- Research Article
8
- 10.3389/fmars.2023.1243713
- Jan 31, 2024
- Frontiers in Marine Science
Foraminifera are adapted to a wide range of environments, and environmental DNA (eDNA) metabarcoding of foraminifera should facilitate development of new environmental indicators. In this study, we used eDNA metabarcoding to evaluate the discrepancy between planktic and benthic foraminifera molecular communities identified in bottom water and short sediment cores. The molecular community was compared to foraminiferal shells in sediment traps set on the seafloor. Samples were collected in June and August around the Takuyo-Daigo Seamount in the western subtropical Pacific Ocean. Approximately 40% of amplicon sequence variants (ASVs) pertained to unknown foraminiferal lineages in sediment samples, compared with only 22% in bottom water. Bottom water contained benthic foraminifera and taxonomically unassigned lineages, which were attributed to resuspended particles. In bottom water, 100 ASVs were assigned to planktic foraminifera. ASVs assigned to Candeina nitida were most abundant and accounted for 36%–86% of planktic foraminiferal ASVs. In sedimentary DNA, Globigerinita glutinata was the most abundant among 33 ASVs of planktic foraminifera. However, transparent shells in sediment traps contained more spinose species, such as Globigerinoides ruber, whereas C. nitida was not found and few G. glutinata were detected. This discrepancy between the three samples may be due to the species-specific preservation, to polymerase chain reaction biases, and/or to low abundance of planktic foraminifers. In sedimentary DNA, 893 ASVs were assigned to high-level foraminiferal taxa. Among benthic foraminiferal lineages, monothalamids were most abundant, as reported in other deep-sea regions. Molecular communities formed one cluster above the boundary at which ASVs sharply decrease across the three cores. Our results suggest that depth within the sediment core can affect foraminiferal ASVs, but the distance between sites up to 200 m did not strongly affect ASVs of sedimentary DNA at least above the boundary at which ASVs sharply decrease. Sequences of foraminiferal DNA in sediment decreased linearly in core PC02-A1, but exponentially in core PC03-B3. The decline of foraminiferal ASVs may reflect both the decreases in numbers of living foraminifera and degradation of DNA in sediment, related to the particle mixing depth.
- Research Article
- 10.3897/aca.4.e64859
- Mar 4, 2021
- ARPHA Conference Abstracts
We applied DNA metabarcoding to evaluate the ecology of genetic variants within several diatom species that are important for biomonitoring. Benthic diatoms are widely used as bioindicators for biomonitoring programmes, including those for European rivers demanded by Water Framework Directive (WFD). Morphological identification of diatoms at species level is required for assessing the ecological status in biomonitoring programmes. However, this is a time-consuming task and requires expert knowledge. In addition, closely related species, which sometimes are scarcely distinguishable on the basis of their morphology, can show different ecological preferences; these may even vary within a single diatom species. Not being able to identify the different ecological preferences shown by the genetic variants of a single species or closely related species, might have consequences for biomonitoring programmes, especially if such differences occur within common species. The key diatom species that we studied were: Fistulifera saprophila (FSAP), widely regarded as a marker for elevated nutrient levels, organic pollution and hence poor ecological status; Achnanthidium minutissimum (ADMI), which usually indicates good ecological status; and Nitzschia inconspicua (NINC) and N. soratensis (NSTS), two species that are widely separated phylogenetically but almost impossible to distinguish in the light microscope. Our dataset was based on high-throughput sequencing using a 312-bp rbcL marker. We used the denoising pipeline DADA2 to infer amplicon sequence variants (ASVs) from 554 environmental samples from river biomonitoring campaigns in Catalonia (NE Spain) and France. Ecological groupings of ASVs were distinguished according to their environmental responses given by Threshold Indicator Taxa ANalysis (TITAN); the environmental parameters that most influenced the occurrence of these groupings were tested using boosted regression trees. We could distinguish three different ecological groupings of ASVs within ADMI and three within FSAP. In each species two of the groupings were clearly separated by their opposite responses to calcium and conductivity and boosted regression trees showed that for three out of four of these groupings, these two variables were among the most important variables for explaining the ASV distributions. The third grouping in FSAP had a negative response to total organic carbon and a positive response to altitude and hence was better represented in less organically polluted waters and higher ecological status than is generally assumed for FSAP. Our analyses did not identify ecological groupings of ASVs within NINC and NSTS but confirmed earlier studies, based on more limited sampling, that indicated different preferences of these species. Conductivity and calcium were the variables that most influenced the occurrence of NINC and NSTS, NINC being better distributed in waters with higher levels of calcium and conductivity than NSTS. Our findings indicate the potential use of DNA metabarcoding for distinguishing the ecological preferences of genetic variants within a single species or closely related species. This information, coupled with the broad knowledge generated over many years using traditional microscope-based identifications, will facilitate the development of more accurate biological indexes for the biomonitoring programmes of the future.
- Research Article
15
- 10.1016/j.hal.2024.102568
- Jan 6, 2024
- Harmful Algae
The high molecular diversity in Noctiluca scintillans is dominated by intra-genomic variations revealed by single cell high-throughput sequencing of 18S rDNA V4
- Research Article
- 10.1016/j.ijpara.2025.09.004
- Sep 1, 2025
- International journal for parasitology
Ecological drivers of parasite genetic diversity: evidence for dilution effects in a single strongylid species infecting sympatric Bornean primates.
- Research Article
27
- 10.5194/essd-13-4913-2021
- Oct 26, 2021
- Earth System Science Data
Abstract. Arctic marine protist communities have been understudied due to challenging sampling conditions, in particular during winter and in deep waters. The aim of this study was to improve our knowledge on Arctic protist diversity through the year, in both the epipelagic (< 200 m depth) and mesopelagic zones (200–1000 m depth). Sampling campaigns were performed in 2014, during five different months, to capture the various phases of the Arctic primary production: January (winter), March (pre-bloom), May (spring bloom), August (post-bloom), and November (early winter). The cruises were undertaken west and north of the Svalbard archipelago, where warmer Atlantic waters from the West Spitsbergen Current meet cold Arctic waters from the Arctic Ocean. From each cruise, station, and depth, 50 L of seawater was collected, and the plankton was size-fractionated by serial filtration into four size fractions between 0.45–200 µm, representing picoplankton (0.45–3 µm), small and large nanoplankton (3–10 and 10–50 µm, respectively), and microplankton (50–200 µm). In addition, vertical net hauls were taken from 50 m depth to the surface at selected stations. The net hauls were fractionated into the large nanoplankton (10–50 µm) and microplankton (50–200 µm) fractions. From the plankton samples DNA was extracted, the V4 region of the 18S rRNA-gene was amplified by polymerase chain reaction (PCR) with universal eukaryote primers, and the amplicons were sequenced by Illumina high-throughput sequencing. Sequences were clustered into amplicon sequence variants (ASVs), representing protist genotypes, with the dada2 pipeline. Taxonomic classification was made against the curated Protist Ribosomal Reference database (PR2). Altogether, 6536 protist ASVs were obtained (including 54 fungal ASVs). Both ASV richness and taxonomic composition varied between size fractions, seasons, and depths. ASV richness was generally higher in the smaller fractions and higher in winter and the mesopelagic samples than in samples from the well-lit epipelagic zone during summer. During spring and summer, the phytoplankton groups diatoms, chlorophytes, and haptophytes dominated in terms of relative read abundance in the epipelagic zone. Parasitic and heterotrophic groups such as Syndiniales and certain dinoflagellates dominated in the mesopelagic zone all year, as well as in the epipelagic zone during the winter. The dataset is available at https://doi.org/10.17882/79823 (Egge et al., 2014).
- Research Article
23
- 10.1002/cpz1.930
- Nov 1, 2023
- Current Protocols
Analysis of the bacterial community from a 16S rRNA gene sequencing technologies requires comparing the reads to a reference database. The challenging task involved in annotation relies on the currently available tools and 16S rRNA databases: SILVA, Greengenes and RDP. A successful annotation depends on the quality of the database. For instance, Greengenes and RDP have not been updated since 2013 and 2016, respectively. In addition, the nature of 16S sequencing technologies (short reads) focuses mainly on the V3-V4 hypervariable region sequencing and hinders the species assignment, in contrast to whole shotgun metagenome sequencing. Here, we combine the results of three standard protocols for 16S rRNA amplicon annotation that utilize homology-based methods, and we propose a new re-annotation strategy to enlarge the percentage of amplicon sequence variants (ASV) classified up to the species level. Following the pattern (reference) method: DADA2 pipeline and SILVA v.138.1 reference database classification (Basic Protocol 1), our method maps the ASV sequences to custom nucleotide BLAST with the SILVA v.138.1 (Basic Protocol 2), and to the 16S database of Bacteria and Archaea of NCBI RefSeq Targeted Loci Project databases (Basic Protocol 3). This new re-annotation workflow was tested in 16S rRNA amplicon data from 156 human fecal samples. The proposed new strategy achieved an increase of nearly eight times the proportion of ASV classified at the species level in contrast to the reference method for the database used in the present research. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Sample inference and taxonomic profiling through DADA2 algorithm. Basic Protocol 2: Custom BLASTN database creation and ASV taxonomical assignment. Basic Protocol 3: ASV taxonomical assignment using NCBI RefSeq Targeted Loci Project database. Basic Protocol 4: Definitive selection of lineages among the three methods.
- Research Article
- 10.31016/1998-8435-2024-18-1-58-65
- Mar 7, 2024
- Russian Journal of Parasitology
The purpose of the research is isolation, identification, and analysis of ASV (Amplicon Sequence Variant) types of Cryptosporidia spp. in pigs in the Vologda Region of the Russian Federation.Materials and methods. The research has been conducted in the Russian Federation for the first time. The research was conducted on pig farms in the Vologda Region of the Northwestern Federal District of the Russian Federation from January to October 2023. Feces were taken from piglets of various age groups, as well as milking sows. The samples were studied using the equipment of the resource center “Genomic Technologies, Proteomics and Cell Biology” of ARRIAM. Species of the genus Cryptosporidia were identified in fecal samples using high-throughput sequencing of 18S rRNA gene fragment amplicon libraries as obtained from nested PCR followed by “denoising”, sequence combining, and restoring the original phylotypes (ASV, (Amplicon Sequence Variant)).Results and discussion. Cryptosporidia spp. species were identified in each age group studied. As a result of high-throughput sequencing of the libraries using the Illumina technology, 20 to 100 thousand nucleotide sequences (reads) were obtained for each sample after processing of which a total of 2,372 ASVs were identified. The analysis of the ASV taxonomic affiliation performed with phylogenetic analysis supplemented by an analysis using the blastn algorithm in the GenBank database showed that, in total, 10 ASVs were only present in all studied samples that had high similarity to sequences deposited in the GenBank as 18S rRNA gene fragments of Cryptosporidium scrofarum. Eight ASV types were unique and did not repeat from farm to farm. Probably, these sequences belong to local populations of C. scrofarum subspecies. Of interest is the discovery of a unique Cryptosporidium sequence of ASV8 type which is only 91.47% similar to the closest relative of the genus, which may indicate a rather distant taxonomic relationship. This type of nucleotide sequence can be further described as a new species. All identified unique ASV nucleotide sequences were deposited in GenBank.
- Research Article
73
- 10.1093/femsre/fuac046
- Nov 23, 2022
- FEMS Microbiology Reviews
Non-cyanobacterial diazotrophs: Global diversity, distribution, ecophysiology, and activity in marine waters.
- Research Article
3208
- 10.1038/ismej.2017.119
- Jul 21, 2017
- The ISME Journal
Recent advances have made it possible to analyze high-throughput marker-gene sequencing data without resorting to the customary construction of molecular operational taxonomic units (OTUs): clusters of sequencing reads that differ by less than a fixed dissimilarity threshold. New methods control errors sufficiently such that amplicon sequence variants (ASVs) can be resolved exactly, down to the level of single-nucleotide differences over the sequenced gene region. The benefits of finer resolution are immediately apparent, and arguments for ASV methods have focused on their improved resolution. Less obvious, but we believe more important, are the broad benefits that derive from the status of ASVs as consistent labels with intrinsic biological meaning identified independently from a reference database. Here we discuss how these features grant ASVs the combined advantages of closed-reference OTUs—including computational costs that scale linearly with study size, simple merging between independently processed data sets, and forward prediction—and of de novo OTUs—including accurate measurement of diversity and applicability to communities lacking deep coverage in reference databases. We argue that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that ASVs should replace OTUs as the standard unit of marker-gene analysis and reporting.