Abstract
BackgroundThe rapid development of next-generation sequencing (NGS) technology has continuously been refreshing the throughput of sequencing data. However, due to the lack of a smart tool that is both fast and accurate, the analysis task for NGS data, especially those with low-coverage, remains challenging.ResultsWe proposed a decision-tree based variant calling algorithm. Experiments on a set of real data indicate that our algorithm achieves high accuracy and sensitivity for SNVs and indels and shows good adaptability on low-coverage data. In particular, our algorithm is obviously faster than 3 widely used tools in our experiments.ConclusionsWe implemented our algorithm in a software named Fuwa and applied it together with 4 well-known variant callers, i.e., Platypus, GATK-UnifiedGenotyper, GATK-HaplotypeCaller and SAMtools, to three sequencing data sets of a well-studied sample NA12878, which were produced by whole-genome, whole-exome and low-coverage whole-genome sequencing technology respectively. We also conducted additional experiments on the WGS data of 4 newly released samples that have not been used to populate dbSNP.
Highlights
The rapid development of next-generation sequencing (NGS) technology has continuously been refreshing the throughput of sequencing data
Overview of Fuwa Fuwa accepts single sample alignment data in Binary Sequence Alignment/Mapping (BAM) format and outputs calls for Single nucleotide variant (SNV) and short indels in Variant Call Format (VCF) [10]
We started from HiSeq Whole-genome sequencing (WGS) (75~ 86× 101-bp paired-end) data, exomecapture data and lowcoverage (~ 4×) whole-genome sequencing data, conducted read alignment with BWA, and applied preprocessing steps including duplicate removal, local realignment and base quality recalibration before the calling step
Summary
The rapid development of next-generation sequencing (NGS) technology has continuously been refreshing the throughput of sequencing data. Due to the lack of a smart tool that is both fast and accurate, the analysis task for NGS data, especially those with low-coverage, remains challenging. Next-generation DNA sequencing (NGS) technologies have made great progress in both improving throughput and lowering cost in recent years. NGS technology can finish a whole-genome sequencing task in a single day for merely one thousand dollars [1]. The massive data sets generated by NGS in research projects such as 1000 Genomes are counted in terabases [2], and it is predicted that in the decade, approximately one hundred million to two billion human genomes will be sequenced [1]. The quality of call sets directly affects downstream analysis such as disease-causing gene detection
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have