Benchmarking variant identification tools for plant diversity discovery

Xing Wu,Hongyu Zhao,Stephen L Dellaporta,Christopher Heffelfinger

doi:10.1186/s12864-019-6057-7

Abstract

BackgroundThe ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets.ResultsA comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation.ConclusionsPrograms in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.

Highlights

The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing
The distribution of the group with higher alignment percentage contained wild species that were closely related to domesticated tomatoes, whereas the lower group contained distantly related wild species based on previous domestication and diversity studies [3, 40]
We evaluated how variants discovery in these 82 tomato samples were changed by mapping reads to a wild reference genome (S. pennellii) [41]

Summary

Introduction

The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. One of the most important steps in genomic analyses is the ability to accurately and comprehensively identify genetic variations. The variant discovery pipeline for WGS dataset can be roughly divided into four steps: read mapping, variant calling, variant filtering, and imputation. Sequence aligners for the read mapping step can be grouped according to their indexing methodologies [9]. Programs such as Novoalign (http://www.novocraft.com) and GSNAP [10] use hash tables indexing methods; whereas BWA [11], SOAP2 [12] and Bowtie2 [13] use Burrows-

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Sep 9, 2019
Citations: 24	License type: open-access

R Discovery Prime

R Discovery Prime

Benchmarking variant identification tools for plant diversity discovery

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Validation and assessment of variant calling pipelines for next-generation sequencing.
Mehdi Pirooznia ... Jennifer Parla
Human Genomics | VOL. 8
Mehdi Pirooznia, et. al.Mehdi Pirooznia ... Jennifer Parla
30 Jul 2014
Human Genomics | VOL. 8

An analytical workflow for accurate variant discovery in highly divergent regions
Shulan Tian ... Huihuang Yan
BMC Genomics | VOL. 17
Shulan Tian, et. al.Shulan Tian ... Huihuang Yan
02 Sep 2016
BMC Genomics | VOL. 17

INDELseek: detection of complex insertions and deletions from next-generation sequencing data
Chun Hang Au ... Anskar Y H Leung
BMC Genomics | VOL. 18
Chun Hang Au, et. al.Chun Hang Au ... Anskar Y H Leung
05 Jan 2017
BMC Genomics | VOL. 18

Defind: Detecting Genomic Deletions by Integrating Read Depth, GC Content, Mapping Quality and Paired-end Mapping Signatures of Next Generation Sequencing Data
Xin Wang ... Huan Zhang
Current Bioinformatics | VOL. 14
Xin Wang, et. al.Xin Wang ... Huan Zhang
07 Jan 2019
Current Bioinformatics | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Benchmarking variant identification tools for plant diversity discovery

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics