How challenging RADseq data turned out to favor coalescent-based species tree inference. A case study in Aichryson (Crassulaceae)

Philipp Hühn,Markus S Dillenberger,Michael Gerschwitz-Eidt,Elvira Hörandl,Jessica A Los,Thibaud F.E Messerschmid,Claudia Paetzold,Benjamin Rieger,Gudrun Kadereit

doi:10.1016/j.ympev.2021.107342

Abstract

Analysing multiple genomic regions while incorporating detection and qualification of discordance among regions has become standard for understanding phylogenetic relationships. In plants, which usually have comparatively large genomes, this is feasible by the combination of reduced-representation library (RRL) methods and high-throughput sequencing enabling the cost effective acquisition of genomic data for thousands of loci from hundreds of samples. One popular RRL method is RADseq. A major disadvantage of established RADseq approaches is the rather short fragment and sequencing range, leading to loci of little individual phylogenetic information. This issue hampers the application of coalescent-based species tree inference. The modified RADseq protocol presented here targets ca. 5,000 loci of 300-600nt length, sequenced with the latest short-read-sequencing (SRS) technology, has the potential to overcome this drawback. To illustrate the advantages of this approach we use the study group Aichryson Webb & Berthelott (Crassulaceae), a plant genus that diversified on the Canary Islands. The data analysis approach used here aims at a careful quality control of the long loci dataset. It involves an informed selection of thresholds for accurate clustering, a thorough exploration of locus properties, such as locus length, coverage and variability, to identify potential biased data and a comparative phylogenetic inference of filtered datasets, accompanied by an evaluation of resulting BS support, gene and site concordance factor values, to improve overall resolution of the resulting phylogenetic trees. The final dataset contains variable loci with an average length of 373nt and facilitates species tree estimation using a coalescent-based summary approach. Additional improvements brought by the approach are critically discussed.

Highlights

For the BSC clustering threshold (CT) selection, we evaluated the number of retained loci, sequence variation (VAR, single nucleotide polymorphism (SNP) and parsimony informative sites (PIS)) and proportion of missingness
Demultiplexed raw data is available at the NCBI Sequence Read Archive in BioProject PRJNA642981
4) We found overall matching trends of locus properties relative to the resulting phylogenetic patterns of maximum likelihood analysis of concatenated loci (CA-maximum likelihood (ML)) and coalescent-based summary method (CB-SM) used for bias detection

Summary

Introduction

Resolving phylogenetic relationships of recently and rapidly radi ating species complexes is a challenge because first, standard markers using universal primers are too conserved and fail to provide sufficient information, and second, inferring relationships is often complicated due to incomplete lineage sorting (ILS), hybridization/introgression and gene duplication/loss events (Pamilo and Nei, 1988; Maddison, 1997; Abbreviations: BSC, between-sample-clustering; CA-ML, maximum likelihood analysis of concatenated loci; CB-SM, coalescent-based summary method; CT, clustering threshold; gCF, gene concordance factor; GTEE, gene tree estimation error; HTS, high throughput sequencing; ILS, incomplete lineage sorting; ISC, insample-clustering; ML, maximum likelihood; MSC, multi-species coalescent (model); NPL, new polymorphic loci; PE, paired-end; PIC, parsimony informative character; PIS, parsimony informative site; RADseq, restriction site-associated DNA sequencing; REase, restriction endonuclease; RRL, reduced-representation library (methods); sCF, site concordance factor; SNP, single nucleotide polymorphism; SRS, short-read sequencing; SVD, SVDquartets; VAR, variable sites (sequence vari ation); var, variability (VAR/locus length/number of samples.Maddison and Knowles, 2006; Kubatko and Degnan, 2007; Whitfield and Lockhart, 2007; Degnan et al, 2006, Degnan and Rosenberg, 2009; Heled and Drummond, 2009; Yang and Rannala, 2010; Rannala et al, 2020). Full-coalescence approaches under the MSC are computa tionally very intensive when applied on large-scale genomic data and often not feasible (McCormack et al, 2013a; Smith et al, 2014; Zimmermann et al, 2014) Other approaches, such as maximum likeli hood analysis of concatenated multi-locus data (de Queiroz et al, 1995; Yang 1996; de Queiroz and Gatesy 2007), coalescent-based summary methods that estimate species trees from independently inferred gene trees (here called “locus trees”) (Mirarab et al, 2014a; Mirarab and Warnow, 2015; Rannala et al, 2020) or coalescent-based methods that use site patterns of assembled loci for species tree inference (Bryant et al, 2012; Chifman and Kubatko, 2014; Bryant and Hahn, 2020), became increasingly popular and widely used. These methods each have advantages and disadvantages and their cor rect application to modern high-throughput data, in particular ap proaches that generate short loci with high amounts of missing data such as RADseq, is highly controversial

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Molecular Phylogenetics and Evolution	Publication Date: Nov 14, 2021
Citations: 17	License type: cc-by

R Discovery Prime

R Discovery Prime

How challenging RADseq data turned out to favor coalescent-based species tree inference. A case study in Aichryson (Crassulaceae)

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Molecular Phylogenetics and Evolution

Lead the way for us

Similar Papers

Sequence Capture versus Restriction Site Associated DNA Sequencing for Shallow Systematics
Michael G Harvey ... Brian Tilston Smith
Systematic Biology | VOL. 65
Michael G Harvey, et. al.Michael G Harvey ... Brian Tilston Smith
10 Jun 2016
Systematic Biology | VOL. 65

Genetic diversity in migratory bats: Results from RADseq data for three tree bat species at an Ohio windfarm
Michael G Sovic ... Bryan C Carstens
PeerJ | VOL. 4
Michael G Sovic, et. al.Michael G Sovic ... Bryan C Carstens
26 Jan 2016
PeerJ | VOL. 4

Comprehensive evaluation of SNP identification with the Restriction Enzyme-based Reduced Representation Library (RRL) method
Ye Du ... Meiru Zhao
BMC Genomics | VOL. 13
Ye Du, et. al.Ye Du ... Meiru Zhao
16 Feb 2012
BMC Genomics | VOL. 13

To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods.
Erin K Molloy ... Tandy Warnow
Systematic Biology | VOL. 67
Erin K Molloy, et. al.Erin K Molloy ... Tandy Warnow
15 Sep 2017
Systematic Biology | VOL. 67

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

How challenging RADseq data turned out to favor coalescent-based species tree inference. A case study in Aichryson (Crassulaceae)

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Molecular Phylogenetics and Evolution