Abstract
The pan-genome of a species is defined as the union of all the genes and non-coding sequences found in all its individuals. However, constructing a pan-genome for plants with large genomes is daunting both in sequencing cost and the scale of the required computational analysis. A more affordable alternative is to focus on the genic repertoire by using transcriptomic data. Here, the software GET_HOMOLOGUES-EST was benchmarked with genomic and RNA-seq data of 19 Arabidopsis thaliana ecotypes and then applied to the analysis of transcripts from 16 Hordeum vulgare genotypes. The goal was to sample their pan-genomes and classify sequences as core, if detected in all accessions, or accessory, when absent in some of them. The resulting sequence clusters were used to simulate pan-genome growth, and to compile Average Nucleotide Identity matrices that summarize intra-species variation. Although transcripts were found to under-estimate pan-genome size by at least 10%, we concluded that clusters of expressed sequences can recapitulate phylogeny and reproduce two properties observed in A. thaliana gene models: accessory loci show lower expression and higher non-synonymous substitution rates than core genes. Finally, accessory sequences were observed to preferentially encode transposon components in both species, plus disease resistance genes in cultivated barleys, and a variety of protein domains from other families that appear frequently associated with presence/absence variation in the literature. These results demonstrate that pan-genome analyses are useful to explore germplasm diversity.
Highlights
High-throughput sequencing has made it possible to assemble whole genomes and transcriptomes at an unprecedented rate, leading to the comparison of individuals of the same species
While bidirectional best hit algorithm (BDBH) seeds clusters with sequences from a selected reference genotype, and skips genes absent from it (Contreras-Moreira and Vinuesa, 2013), OrthoMCL algorithm (OMCL) groups nodes in a graph to build clusters which can have any composition, even without sequences from the reference (Li et al, 2003)
Among the parameters used to control these steps, alignment coverage is perhaps the most important, and it is calculated by default as depicted in Figure 1B, with respect to the shortest sequence, after adding up all non-overlapping segments reported by BLASTN
Summary
High-throughput sequencing has made it possible to assemble whole genomes and transcriptomes at an unprecedented rate, leading to the comparison of individuals of the same species. Some studies compared ecotypes of model species Arabidopsis thaliana and accessions of crops such as maize, barley, soybean, or rice, revealing that dispensable genes play important roles in evolution, and in the complex interplay between plants and the environment (Cao et al, 2011; Hansey et al, 2012; Dai et al, 2014; Hirsch et al, 2014; Li et al, 2014; Yao et al, 2015). Exploration of presence/absence variation (PAV) of accessory loci increases our capacity to link genotypes to phenotypes, and PAV has been found to explain phenotypic differences among cultivars beyond those revealed by standard SNP-based genotyping methods (Yano et al, 2016)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.