Abstract

BackgroundWith the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed “PGen”, an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way.ResultsWe have developed both a Linux version in GitHub (https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), (http://soykb.org/Pegasus/index.php). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser (http://soykb.org/NGS_Resequence/NGS_index.php) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers.ConclusionPGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1227-y) contains supplementary material, which is available to authorized users.

Highlights

  • In-depth informatics analysis of genotypic data can provide a better understanding of genotype-phenotype correlations with applications designed to assist in the work toward improvement of traits

  • single nucleotide polymorphisms (SNPs)/Indel identification procedures need to be followed by other complex downstream analyses ranging from SNP annotations, copy number variations (CNV), genome wide associations studies (GWAS) analysis, haplotype analysis and others

  • There is a significant need for fast, efficient and easy-to-use computational pipelines to be made available to biological researchers, that use the most advanced techniques such as high-performance computing (HPC), cloud storage resources and provide access to remote computing resources with a scalability to meet the demands of such research projects

Read more

Summary

Results

Genomic variations for soybean germplasm lines We have generated genomics variations and downstream results for all 500+ resequencing lines from multiple projects. Identified variants were used in studies of GWAS analysis using our in-house developed method BHIT [21], and soybean transporter database (SoyTD) analysis. We analyzed another dataset, which contained 28 Brazilian soybean cultivars sequenced at a coverage of 15X [22]. The NGS data browser provides six different tabs for browsing various types of analysis results generated by PGen as described below:

Conclusion
Introduction
Methods
Upload data
Create Workflow
Monitor Workflow
Workflow Results
Summary
FastQC
SnpEff Annotation
Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.