Abstract

Analysing multiple genomic regions while incorporating detection and qualification of discordance among regions has become standard for understanding phylogenetic relationships. In plants, which usually have comparatively large genomes, this is feasible by the combination of reduced-representation library (RRL) methods and high-throughput sequencing enabling the cost effective acquisition of genomic data for thousands of loci from hundreds of samples. One popular RRL method is RADseq. A major disadvantage of established RADseq approaches is the rather short fragment and sequencing range, leading to loci of little individual phylogenetic information. This issue hampers the application of coalescent-based species tree inference. The modified RADseq protocol presented here targets ca. 5,000 loci of 300-600nt length, sequenced with the latest short-read-sequencing (SRS) technology, has the potential to overcome this drawback. To illustrate the advantages of this approach we use the study group Aichryson Webb & Berthelott (Crassulaceae), a plant genus that diversified on the Canary Islands. The data analysis approach used here aims at a careful quality control of the long loci dataset. It involves an informed selection of thresholds for accurate clustering, a thorough exploration of locus properties, such as locus length, coverage and variability, to identify potential biased data and a comparative phylogenetic inference of filtered datasets, accompanied by an evaluation of resulting BS support, gene and site concordance factor values, to improve overall resolution of the resulting phylogenetic trees. The final dataset contains variable loci with an average length of 373nt and facilitates species tree estimation using a coalescent-based summary approach. Additional improvements brought by the approach are critically discussed.

Highlights

  • For the BSC clustering threshold (CT) selection, we evaluated the number of retained loci, sequence variation (VAR, single nucleotide polymorphism (SNP) and parsimony informative sites (PIS)) and proportion of missingness

  • Demultiplexed raw data is available at the NCBI Sequence Read Archive in BioProject PRJNA642981

  • 4) We found overall matching trends of locus properties relative to the resulting phylogenetic patterns of maximum likelihood analysis of concatenated loci (CA-maximum likelihood (ML)) and coalescent-based summary method (CB-SM) used for bias detection

Read more

Summary

Introduction

Resolving phylogenetic relationships of recently and rapidly radi­ ating species complexes is a challenge because first, standard markers using universal primers are too conserved and fail to provide sufficient information, and second, inferring relationships is often complicated due to incomplete lineage sorting (ILS), hybridization/introgression and gene duplication/loss events (Pamilo and Nei, 1988; Maddison, 1997; Abbreviations: BSC, between-sample-clustering; CA-ML, maximum likelihood analysis of concatenated loci; CB-SM, coalescent-based summary method; CT, clustering threshold; gCF, gene concordance factor; GTEE, gene tree estimation error; HTS, high throughput sequencing; ILS, incomplete lineage sorting; ISC, insample-clustering; ML, maximum likelihood; MSC, multi-species coalescent (model); NPL, new polymorphic loci; PE, paired-end; PIC, parsimony informative character; PIS, parsimony informative site; RADseq, restriction site-associated DNA sequencing; REase, restriction endonuclease; RRL, reduced-representation library (methods); sCF, site concordance factor; SNP, single nucleotide polymorphism; SRS, short-read sequencing; SVD, SVDquartets; VAR, variable sites (sequence vari­ ation); var, variability (VAR/locus length/number of samples.Maddison and Knowles, 2006; Kubatko and Degnan, 2007; Whitfield and Lockhart, 2007; Degnan et al, 2006, Degnan and Rosenberg, 2009; Heled and Drummond, 2009; Yang and Rannala, 2010; Rannala et al, 2020). Full-coalescence approaches under the MSC are computa­ tionally very intensive when applied on large-scale genomic data and often not feasible (McCormack et al, 2013a; Smith et al, 2014; Zimmermann et al, 2014) Other approaches, such as maximum likeli­ hood analysis of concatenated multi-locus data (de Queiroz et al, 1995; Yang 1996; de Queiroz and Gatesy 2007), coalescent-based summary methods that estimate species trees from independently inferred gene trees (here called “locus trees”) (Mirarab et al, 2014a; Mirarab and Warnow, 2015; Rannala et al, 2020) or coalescent-based methods that use site patterns of assembled loci for species tree inference (Bryant et al, 2012; Chifman and Kubatko, 2014; Bryant and Hahn, 2020), became increasingly popular and widely used. These methods each have advantages and disadvantages and their cor­ rect application to modern high-throughput data, in particular ap­ proaches that generate short loci with high amounts of missing data such as RADseq, is highly controversial

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call