PGen: large-scale genomic variations analysis workflow and browser in SoyKB.

Yang Liu,Shuai Zeng,Trupti Joshi,Prasad P Calyam,Juexin Wang,Dong Xu,Joao V Maldonado Dos Santos,Mats Rynge,Shiyuan Chen,Nirav Merchant,Yuanxun Zhang,Saad M Khan,Henry T Nguyen,Babu Valliyodan

doi:10.1186/s12859-016-1227-y

Abstract

BackgroundWith the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed “PGen”, an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way.ResultsWe have developed both a Linux version in GitHub (https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), (http://soykb.org/Pegasus/index.php). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser (http://soykb.org/NGS_Resequence/NGS_index.php) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers.ConclusionPGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1227-y) contains supplementary material, which is available to authorized users.

Highlights

In-depth informatics analysis of genotypic data can provide a better understanding of genotype-phenotype correlations with applications designed to assist in the work toward improvement of traits
single nucleotide polymorphisms (SNPs)/Indel identification procedures need to be followed by other complex downstream analyses ranging from SNP annotations, copy number variations (CNV), genome wide associations studies (GWAS) analysis, haplotype analysis and others
There is a significant need for fast, efficient and easy-to-use computational pipelines to be made available to biological researchers, that use the most advanced techniques such as high-performance computing (HPC), cloud storage resources and provide access to remote computing resources with a scalability to meet the demands of such research projects

Summary

Results

Genomic variations for soybean germplasm lines We have generated genomics variations and downstream results for all 500+ resequencing lines from multiple projects. Identified variants were used in studies of GWAS analysis using our in-house developed method BHIT [21], and soybean transporter database (SoyTD) analysis. We analyzed another dataset, which contained 28 Brazilian soybean cultivars sequenced at a coverage of 15X [22]. The NGS data browser provides six different tabs for browsing various types of analysis results generated by PGen as described below:

Conclusion

Introduction

Methods

Upload data

Create Workflow

Monitor Workflow

Workflow Results

Summary

FastQC

SnpEff Annotation

Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 1, 2016
Citations: 28	License type: cc-by

R Discovery Prime

R Discovery Prime

PGen: large-scale genomic variations analysis workflow and browser in SoyKB.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Short Read (Next-Generation) Sequencing
Jaya Punetha ... Eric P Hoffman
Circulation: Cardiovascular Genetics | VOL. 6
Jaya Punetha, et. al.Jaya Punetha ... Eric P Hoffman
14 Jul 2013
Circulation: Cardiovascular Genetics | VOL. 6

A strategy for genome-wide identification of gene based polymorphisms in rice reveals non-synonymous variation and functional genotypic markers.
Subodh K Srivastava ... Pawel Wolinski
PloS one | VOL. 9
Subodh K Srivastava, et. al.Subodh K Srivastava ... Pawel Wolinski
19 Sep 2014
PloS one | VOL. 9

Abstract 3171: Powerful target capture/NGS approach reveals extensive genetic changes including SNVs, CNVs and gross chromosomal rearrangements in colorectal cancers
Hongzheng Dai ... Chin-Ying Shih
Cancer Research | VOL. 76
Hongzheng Dai, et. al.Hongzheng Dai ... Chin-Ying Shih
15 Jul 2016
Cancer Research | VOL. 76

Characterization of biomarkers to immune checkpoint blockade therapy across solid tumors.
Meaghan Russell ... Dana Vuzman
Journal of Clinical Oncology | VOL. 36
Meaghan Russell, et. al.Meaghan Russell ... Dana Vuzman
10 Feb 2018
Journal of Clinical Oncology | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PGen: large-scale genomic variations analysis workflow and browser in SoyKB.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics