Abstract

Most existing statistical methods developed for calling single nucleotide polymorphisms (SNPs) using next-generation sequencing (NGS) data are based on Bayesian frameworks, and there does not exist any SNP caller that produces p-values for calling SNPs in a frequentist framework. To fill in this gap, we develop a new method MAFsnp, a Multiple-sample based Accurate and Flexible algorithm for calling SNPs with NGS data. MAFsnp is based on an estimated likelihood ratio test (eLRT) statistic. In practical situation, the involved parameter is very close to the boundary of the parametric space, so the standard large sample property is not suitable to evaluate the finite-sample distribution of the eLRT statistic. Observing that the distribution of the test statistic is a mixture of zero and a continuous part, we propose to model the test statistic with a novel two-parameter mixture distribution. Once the parameters in the mixture distribution are estimated, p-values can be easily calculated for detecting SNPs, and the multiple-testing corrected p-values can be used to control false discovery rate (FDR) at any pre-specified level. With simulated data, MAFsnp is shown to have much better control of FDR than the existing SNP callers. Through the application to two real datasets, MAFsnp is also shown to outperform the existing SNP callers in terms of calling accuracy. An R package “MAFsnp” implementing the new SNP caller is freely available at http://homepage.fudan.edu.cn/zhangh/softwares/.

Highlights

  • The development of next-generation sequencing (NGS) technologies in the past few years has transformed today’s biological science [1]

  • MAFsnp is based on a likelihood function for the NGS read counts from multiple samples, and the single nucleotide polymorphisms (SNPs) calling issue is transformed into a hypothesis testing problem on the minor allele frequency (MAF) for each candidate locus, an estimated likelihood ratio test statistic is used to detect SNPs

  • Most existing SNP callers are based on Bayes frameworks, which cannot control false discovery rate (FDR) at desired nominal levels

Read more

Summary

Introduction

The development of next-generation sequencing (NGS) technologies in the past few years has transformed today’s biological science [1]. With cheap and ultra-high throughput characteristics [2], the NGS technologies have been widely applied to a vast number of biological branches [3,4,5,6,7]. Many projects such as the 1000 Genomes Project [8, 9], the Cancer Genome Atlas Project [10], the NHLBI Exome Sequencing Project [11] have been carried out, trying to elucidating all forms of human hereditary polymorphism. Single-nucleotide polymorphisms (SNPs) are commonly seen in many conceivable biological processes such as microRNA binding site [12], PLOS ONE | DOI:10.1371/journal.pone.0135332 August 26, 2015

A Multi-Sample Accurate and Flexible SNP Caller Using NGS Data
Methods
Findings
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call