Abstract
With the rapid development of short-read sequencing technologies, many population-scale resequencing studies have been carried out to study the associations between human genome variants and various phenotypes in recent years. Variant calling is one of the core bioinformatics tasks in such studies to comprehensively discover genomic variants in sequenced samples. Many efforts have been made to develop short read-based variant calling approaches; however, state-of-the-art tools are still computationally expensive. Meanwhile, cutting-edge genomics studies also have higher requirements on the yields of variant calling. Herein, we propose Partial-Order Alignment-based single nucleotide polymorphism (SNV) and Indel caller (Psi-caller), a lightweight variant calling algorithm that simultaneously achieves high performance and yield. Mainly, Psi-caller recognizes and divides the candidate variant site into three categories according to the complexity and location of the signatures and employs various methods including binomial model, partial-order alignment, and de Bruijn graph-based local assembly to handle various categories of candidate variant sites to call and genotype SNVs/Indels, respectively. Benchmarks on simulated and real short-read sequencing data sets demonstrate that Psi-caller is times faster than state-of-the-art tools with higher or equal sensitivity and accuracy. It has the potential to well handle large-scale data sets in cutting-edge genomics studies.
Highlights
High-throughput sequencing (HTS) has become a fundamental approach to characterize human genomes (Lander et al, 2001; Shendure et al, 2019)
(1) Task splitting: Psi-caller splits reference genome into fixedsize blocks to make a number of variant calling subtasks
Local genomic regions with variant signatures are recognized as candidates and categorized into three classes according to their positions and the ratio of supporting reads, i.e., high confidence candidates, low confidence candidates, and candidates in tandem repeat regions and lowcomplexity regions
Summary
High-throughput sequencing (HTS) has become a fundamental approach to characterize human genomes (Lander et al, 2001; Shendure et al, 2019). Single nucleotide polymorphisms (SNVs) and short insertions/deletions (Indels) are the genomic alteration that usually refers to the change of less than 50-base pair (bp) nucleotide fragments compared to structural variants. Long reads often suffer from high error rates including substitution, small insertions, and deletion (Roberts et al, 2013; Jain et al, 2016), which is still non-trivial for long read-based caller to distinguish genuine variants and sequencing errors. For most short-read sequencing platforms, Indel errors are rare and simultaneously achieve high base accuracy (>99%), which have been proven more helpful for SNV/Indel calling in several largepopulation genome projects (Auton et al, 2015; Wu et al, 2019)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.