DiscoSnp-RAD: de novo detection of small variants for RAD-Seq population genomics

Jérémy Gauthier,Tomasz Suchan,Pierre Peterlongo,Chloé Riou,Charlotte Mouden,Claire Lemaitre,Nadir Alvarez,Nils Arrigo

doi:10.7717/peerj.9291

Abstract

Restriction site Associated DNA Sequencing (RAD-Seq) is a technique characterized by the sequencing of specific loci along the genome that is widely employed in the field of evolutionary biology since it allows to exploit variants (mainly Single Nucleotide Polymorphism—SNPs) information from entire populations at a reduced cost. Common RAD dedicated tools, such as STACKS or IPyRAD, are based on all-vs-all read alignments, which require consequent time and computing resources. We present an original method, DiscoSnp-RAD, that avoids this pitfall since variants are detected by exploiting specific parts of the assembly graph built from the reads, hence preventing all-vs-all read alignments. We tested the implementation on simulated datasets of increasing size, up to 1,000 samples, and on real RAD-Seq data from 259 specimens of Chiastocheta flies, morphologically assigned to seven species. All individuals were successfully assigned to their species using both STRUCTURE and Maximum Likelihood phylogenetic reconstruction. Moreover, identified variants succeeded to reveal a within-species genetic structure linked to the geographic distribution. Furthermore, our results show that DiscoSnp-RAD is significantly faster than state-of-the-art tools. The overall results show that DiscoSnp-RAD is suitable to identify variants from RAD-Seq data, it does not require time-consuming parameterization steps and it stands out from other tools due to its completely different principle, making it substantially faster, in particular on large datasets.

Highlights

Next-generation sequencing and the ability to obtain genomic sequences for hundreds to thousands of individuals of the same species has opened new horizons in population genomics research
After validation tests on simulated datasets of increasing size, we present an application of the DiscoSnp-RAD implementation on double-digest Restriction-site Associated DNA sequencing” (RAD-Seq) data from a genus-wide sampling of parasitic flies belonging to Chiastocheta genus
We first recall the fundamentals of the DiscoSnp++ algorithm, which is based on the analysis of the de Bruijn graph (Pevzner, Tang & Tesler, 2004), which is a directed graph where the set of vertices corresponds to the set of words of length k (k-mers) contained in the reads, and there is an oriented edge between two k-mers, say s and t, if they perfectly overlap on k − 1 nucleotides, that is to say if the last k − 1 suffix of s equals the first k − 1 prefix of t

Summary

Introduction

Next-generation sequencing and the ability to obtain genomic sequences for hundreds to thousands of individuals of the same species has opened new horizons in population genomics research This has been made possible by the development of cost-efficient approaches to obtain sufficient homologous genomic regions, by reproducible genome complexity reduction and multiplexing several samples within a single sequencing run (Andrews et al, 2016). This approach encompasses various methods with different intermediate steps to optimize the genome sampling, for example, ddRAD (Peterson et al, 2012), GBS (Elshire et al, 2011), 2b-RAD (Wang et al, 2012), 3RAD/RADcap (Hoffberg et al, 2016). To de novo build homologous genomic loci and extract informative variations, several methods have been developed, such as STACKS (Catchen et al, 2013) and PyRAD (Eaton, 2014), as well as its derived rewritten version IPyRAD (Eaton & Overcast, 2020), being the most commonly used in the population genomics community

Methods

Results

Discussion

Conclusion