Marker Development for Phylogenomics: The Case of Orobanchaceae, a Plant Family with Contrasting Nutritional Modes.

Xi Li,Da Pan,Baohai Hao,Gerald M Schneeweiss

doi:10.3389/fpls.2017.01973

Abstract

Phylogenomic approaches, employing next-generation sequencing (NGS) techniques, have revolutionized systematic and evolutionary biology. Target enrichment is an efficient and cost-effective method in phylogenomics and is becoming increasingly popular. Depending on availability and quality of reference data as well as on biological features of the study system, (semi-)automated identification of suitable markers will require specific bioinformatic pipelines. Here, we established a highly flexible bioinformatic pipeline, BaitsFinder, to identify putative orthologous single copy genes (SCGs) and to construct bait sequences in a single workflow. Additionally, this pipeline has been constructed to be able to cope with challenging data sets, such as the nutritionally heterogeneous plant family Orobanchaceae. To this end, we used transcriptome data of differing quality available for four Orobanchaceae species and, as reference, SCG data from monkeyflower (Erythranthe guttata, syn. Mimulus g.; 1,915 genes) and tomato (Solanum lycopersicum; 391 genes). Depending on whether gaps were permitted in initial blast searches of the four Orobanchaceae species against the reference, our pipeline identified 1,307 and 981 SCGs with average length of 994 bp and 775 bp, respectively. Automated bait sequence construction (using 2× tiling) resulted in 38,170 and 21,856 bait sequences, respectively. In comparison to the recently published MarkerMiner 1.0 pipeline BaitsFinder identified about 1.6 times as many SCGs (of at least 900 bp length). Skipping steps specific to analyses of Orobanchaceae, BaitsFinder was successfully used in a group of non-parasitic plants (three Asteraceae species and, as reference, SCG data from Arabidopsis thaliana based on previously compiled SCGs). Thus, BaitsFinder is expected to be broadly applicable in groups, where only transcriptomes or partial genome data of differing quality are available.

Highlights

Combining target enrichment with next-generation sequencing (NGS) strategies can yield a large number of low copy nuclear (LCN) loci and is becoming increasingly popular for systematic and evolutionary biology (Lemmon and Lemmon, 2013)
As expected, using gapped blast resulted in more recovered loci: 2,050 single copy genes (SCGs) in gapped blast versus 1,845 SCGs in ungapped blast and, after filtering for presence in Lindenbergia, 1,690 in gapped blast versus 1,555 in ungapped blast
As data quality differs among our focal species, we only considered a single focal species, L. philippensis, which has the best quality data

Summary

Introduction

Combining target enrichment with next-generation sequencing (NGS) strategies can yield a large number of low copy nuclear (LCN) loci and is becoming increasingly popular for systematic and evolutionary biology (Lemmon and Lemmon, 2013). Campana (2017) developed BaitsTools, which automates bait design from various sources (e.g., alignments, unaligned sequences, and RADseq loci) and quality checking of obtained baits Disadvantages of these approaches include reliance on whole or draft genome sequences (de Sousa et al, 2014; Weitemier et al, 2014) or settings that bias against less conserved loci, i.e., strictly reciprocal blast searches in MarkerMiner (Chamala et al, 2015) and the highly reduced bait-to-target distances in BaitFisher (Mayer et al, 2016). None of these methods assessed the effect of using different blast strategies (with and without gaps) on number and length of recovered LCN loci

Methods

Results

Conclusion