Bioinformatic analysis of genotype by sequencing (GBS) data with NGSEP

Claudia Perea,Juan David Lobaton,Bodo Raatz,Juan Camilo Quintero,Jorge Duitama,Daniel Felipe Cruz,Juan Fernando De La Hoz,Paulo Izquierdo

doi:10.1186/s12864-016-2827-7

Claudia Perea, Juan David Lobaton + Show 6 more

Open Access

https://doi.org/10.1186/s12864-016-2827-7

Copy DOI

Abstract

BackgroundTherecent development and availability of different genotype by sequencing (GBS) protocols provided a cost-effective approach to perform high-resolution genomic analysis of entire populations in different species. The central component of all these protocols is the digestion of the initial DNA with known restriction enzymes, to generate sequencing fragments at predictable and reproducible sites. This allows to genotype thousands of genetic markers on populations with hundreds of individuals. Because GBS protocols achieve parallel genotyping through high throughput sequencing (HTS), every GBS protocol must include a bioinformatics pipeline for analysis of HTS data. Our bioinformatics group recently developed the Next Generation Sequencing Eclipse Plugin (NGSEP) for accurate, efficient, and user-friendly analysis of HTS data.ResultsHere we present the latest functionalities implemented in NGSEP in the context of the analysis of GBS data. We implemented a one step wizard to perform parallel read alignment, variants identification and genotyping from HTS reads sequenced from entire populations. We added different filters for variants, samples and genotype calls as well as calculation of summary statistics overall and per sample, and diversity statistics per site. NGSEP includes a module to translate genotype calls to some of the most widely used input formats for integration with several tools to perform downstream analyses such as population structure analysis, construction of genetic maps, genetic mapping of complex traits and phenotype prediction for genomic selection. We assessed the accuracy of NGSEP on two highly heterozygous F1 cassava populations and on an inbred common bean population, and we showed that NGSEP provides similar or better accuracy compared to other widely used software packages for variants detection such as GATK, Samtools and Tassel.ConclusionsNGSEP is a powerful, accurate and efficient bioinformatics software tool for analysis of HTS data, and also one of the best bioinformatic packages to facilitate the analysis and to maximize the genomic variability information that can be obtained from GBS experiments for population genomics.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-016-2827-7) contains supplementary material, which is available to authorized users.

Highlights

The recent development and availability of different genotype by sequencing (GBS) protocols provided a cost-effective approach to perform high-resolution genomic analysis of entire populations in different species
Deconvolution can be generally applied to any type of sequencing in which samples are identified by barcodes, it is important for GBS data because in GBS experiments 96 samples are sequenced per lane to achieve cost efficiency
This allows to remove contamination of adaptor sequence on the three prime end of the reads, which we have identified as the most important issue affecting the quality of the reads produced by the Elshire protocol for GBS

Summary

Introduction

The recent development and availability of different genotype by sequencing (GBS) protocols provided a cost-effective approach to perform high-resolution genomic analysis of entire populations in different species. The central component of all these protocols is the digestion of the initial DNA with known restriction enzymes, to generate sequencing fragments at predictable and reproducible sites. This allows to genotype thousands of genetic markers on populations with hundreds of individuals. Sequenced reads can be demultiplexed and either analyzed denovo or aligned to a reference genome if available In the latter case, variants can be identified using analysis pipelines similar to those used for analysis of whole genome resequencing data [8]. GBS is becoming the method of choice for several applications in plant genomics and plant breeding [9], such as the analysis of population dynamics [2], construction of high density genetic maps [4, 8], genetic mapping of complex traits through Genome-Wide Association Studies (GWAS) [3] and estimation of breeding values in genomic selection [1, 5]

Methods

Results

Conclusion