Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species.

Michael G Harvey,Gary R Graves,Robb T Brumfield,Glenn F Seeholzer,Caroline Duffie Judy,James M Maley

doi:10.7717/peerj.895

Michael G Harvey, Gary R Graves + Show 4 more

Open Access

https://doi.org/10.7717/peerj.895

Copy DOI

Abstract

Comparing inferences among datasets generated using short read sequencing may provide insight into the concerted impacts of divergence, gene flow and selection across organisms, but comparisons are complicated by biases introduced during dataset assembly. Sequence similarity thresholds allow the de novo assembly of short reads into clusters of alleles representing different loci, but the resulting datasets are sensitive to both the similarity threshold used and to the variation naturally present in the organism under study. Thresholds that require high sequence similarity among reads for assembly (stringent thresholds) as well as highly variable species may result in datasets in which divergent alleles are lost or divided into separate loci (‘over-splitting’), whereas liberal thresholds increase the risk of paralogous loci being combined into a single locus (‘under-splitting’). Comparisons among datasets or species are therefore potentially biased if different similarity thresholds are applied or if the species differ in levels of within-lineage genetic variation. We examine the impact of a range of similarity thresholds on assembly of empirical short read datasets from populations of four different non-model bird lineages (species or species pairs) with different levels of genetic divergence. We find that, in all species, stringent similarity thresholds result in fewer alleles per locus than more liberal thresholds, which appears to be the result of high levels of over-splitting. The frequency of putative under-splitting, conversely, is low at all thresholds. Inferred genetic distances between individuals, gene tree depths, and estimates of the ancestral mutation-scaled effective population size (θ) differ depending upon the similarity threshold applied. Relative differences in inferences across species differ even when the same threshold is applied, but may be dramatically different when datasets assembled under different thresholds are compared. These differences not only complicate comparisons across species, but also preclude the application of standard mutation rates for parameter calibration. We suggest some best practices for assembling short read data to maximize comparability, such as using more liberal thresholds and examining the impact of different thresholds on each dataset.

Highlights

With the proliferation of population-level datasets obtained using massively parallel sequencing technologies, there is increasing interest in studies that compare inferences from genomic datasets obtained from different species (e.g., Leache et al, 2013; Smith et al, 2013) or from different genomic regions (e.g., Evans et al, 2014; Harvey et al, 2013; Leache et al, 2015)
That inferences differ among lineages with different population histories, and according to the similarity threshold applied during dataset assembly
Differences in the impact of similarity thresholds across datasets reduce the utility of those datasets for comparative studies, and preclude the application of standardized mutation rate estimates that would allow demographic parameters in non-model species to be converted to absolute values (DaCosta & Sorenson, 2014)

Summary

Introduction

With the proliferation of population-level datasets obtained using massively parallel sequencing technologies, there is increasing interest in studies that compare inferences from genomic datasets obtained from different species (e.g., Leache et al, 2013; Smith et al, 2013) or from different genomic regions (e.g., Evans et al, 2014; Harvey et al, 2013; Leache et al, 2015). Of short sequence reads into orthologous loci is a key component of post-sequence processing, and commonly used methods can lead to biases in population genetic parameter estimation (Ilut, Nydam & Hare, 2014). Selecting the most appropriate similarity threshold is challenging, primarily because the amount of genetic (allelic) variation can vary greatly among orthologous loci within a species (Ilut, Nydam & Hare, 2014). Because the amount of genetic variation varies among species and genomic regions, a particular similarity threshold may impact each dataset differently, potentially influencing inferences in comparative studies

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PeerJ	Publication Date: Apr 21, 2015
Citations: 76	License type: cc-by

R Discovery Prime

R Discovery Prime

Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ

Lead the way for us

Similar Papers

Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads
Seth Commichaux ... Narjol Gonzalez-Escalona
BMC Genomics | VOL. 22
Seth Commichaux, et. al.Seth Commichaux ... Narjol Gonzalez-Escalona
26 May 2021
BMC Genomics | VOL. 22

Microindel detection in short-read sequence data
Peter Krawitz ... Marten Jäger
Bioinformatics | VOL. 26
Peter Krawitz, et. al.Peter Krawitz ... Marten Jäger
09 Feb 2010
Bioinformatics | VOL. 26

Estimation of the ancestral effective population sizes of African great apes under different selection regimes.
Carlos G Schrago
Genetica | VOL. 142
Carlos G SchragoCarlos G Schrago
13 Jun 2014
Genetica | VOL. 142

De novo assembly of short sequence reads
K Paszkiewicz ... D J Studholme
Briefings in Bioinformatics | VOL. 11
K Paszkiewicz, et. al.K Paszkiewicz ... D J Studholme
19 Aug 2010
Briefings in Bioinformatics | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ