Abstract
High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.
Highlights
As the basis of biological properties and heredity, the genome of a species is a valuable resource for numerous studies
Variations within the A. thaliana population were studied in the 1001 genomes project [1]
Our study evaluated the performance of 50 variant calling pipelines including combinations of five read mappers (Bowtie2, BWA-MEM, CLC-mapper, GEM3, Novoalign) and eight different variant callers (SAMtools/BCFtools, CLC-caller, FreeBayes, GATK v3.8/v4.0/v4.1, LoFreq, SNVer, VarDict, VarScan) that are frequently applied in re-sequencing studies
Summary
As the basis of biological properties and heredity, the genome of a species is a valuable resource for numerous studies. There are subtle differences between individuals of the same species, which are of academic and economic interest as these determine phenotypic differences. Dropping sequencing costs boosted high-throughput sequencing projects, facilitating the analysis of this genetic diversity. Variations within the A. thaliana population were studied in the 1001 genomes project [1]. As the number of high-quality reference genome sequences rises continuously, the number of re-sequencing projects increases as well [2]. There are pangenome projects for various species focusing on the genome evolution [3,4,5] and mapping-by-sequencing projects which focus on agronomically important traits of crops [6,7,8,9]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.