Abstract
The RADseq technology allows researchers to efficiently develop thousands of polymorphic loci across multiple individuals with little or no prior information on the genome. However, many questions remain about the biases inherent to this technology. Notably, sequence misalignments arising from paralogy may affect the development of single nucleotide polymorphism (SNP) markers and the estimation of genetic diversity. We evaluated the impact of putative paralog loci on genetic diversity estimation during the development of SNPs from a RADseq dataset for the nonmodel tree species Robinia pseudoacacia L. We sequenced nine genotypes and analyzed the frequency of putative paralogous RAD loci as a function of both the depth of coverage and the mismatch threshold allowed between loci. Putative paralogy was detected in a very variable number of loci, from 1% to more than 20%, with the depth of coverage having a major influence on the result. Putative paralogy artificially increased the observed degree of polymorphism and resulting estimates of diversity. The choice of the depth of coverage also affected diversity estimation and SNP validation: A low threshold decreased the chances of detecting minor alleles while a high threshold increased allelic dropout. SNP validation was better for the low threshold (4×) than for the high threshold (18×) we tested. Using the strategy developed here, we were able to validate more than 80% of the SNPs tested by means of individual genotyping, resulting in a readily usable set of 330 SNPs, suitable for use in population genetics applications.
Highlights
With the extensive development of next-generation sequencing (NGS) technologies and the accurate bioinformatics treatment of data, it is feasible to obtain genomic data and develop single nucleotide polymorphism (SNP) markers for nonmodel species (Etter et al, 2011)
Consistent with the results reported above, putative paralogy directly influenced the level of polymorphism measured at the sequence level: RAD loci identified as paralogous were more polymorphic than nonparalogous loci (Table 1)
RADseq technology is increasingly used in population genetics studies because it provides a rapid and cheap means for developing thousands of polymorphic SNP loci, almost regardless of genome size and previous genomic knowledge (Mastretta-Yanes et al, 2015)
Summary
With the extensive development of next-generation sequencing (NGS) technologies and the accurate bioinformatics treatment of data, it is feasible to obtain genomic data and develop single nucleotide polymorphism (SNP) markers for nonmodel species (Etter et al, 2011). Developed methods for the detection of paralogy in NGS data are based on the elimination of RAD loci containing too many SNPs or deviating from Hardy–Weinberg equilibrium (Lexer et al, 2014), the elimination of RAD loci with a too high coverage (Bianco et al, 2014), or on tests for the existence of two loci at each given position, as implemented in the paralogy filtering option of the reads2snp program (Gayral et al, 2013) These methods help to increase the efficiency of de novo assemblies of short reads and the detection of sequencing misalignments, resulting in more accurate SNP detection. We added a validation step through genotyping to estimate the efficacy of the data cleaning with this approach
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.