SNP Calling Research Articles

BackgroundGenome-wide association studies (GWAS) have identified many common polymorphisms associated with complex traits. However, these associated common variants explain only a small fraction of the phenotypic variances, leaving a substantial portion of genetic heritability unexplained. As a result, searches for "missing" heritability are drawing increasing attention, particularly for rare variant studies that often require a large sample size and, thus, extensive sequencing effort. Although the development of next generation sequencing (NGS) technologies has made it possible to sequence a large number of reads economically and efficiently, it is still often cost prohibitive to sequence thousands of individuals that are generally required for association studies. A more efficient and cost-effective design would involve pooling the genetic materials of multiple individuals together and then sequencing the pools, instead of the individuals. This pooled sequencing approach has improved the plausibility of association studies for rare variants, while, at the same time, posed a great challenge to the pooled sequencing data analysis, essentially because individual sample identity is lost, and NGS sequencing errors could be hard to distinguish from low frequency alleles.ResultsA unified approach for estimating minor allele frequency, SNP calling and association studies based on pooled sequencing data using an expectation maximization (EM) algorithm is developed in this paper. This approach makes it possible to study the effects of minor allele frequency, sequencing error rate, number of pools, number of individuals in each pool, and the sequencing depth on the estimation accuracy of minor allele frequencies. We show that the naive method of estimating minor allele frequencies by taking the fraction of observed minor alleles can be significantly biased, especially for rare variants. In contrast, our EM approach can give an unbiased estimate of the minor allele frequency under all scenarios studied in this paper. A SNP calling approach, EM-SNP, for pooled sequencing data based on the EM algorithm is then developed and compared with another recent SNP calling method, SNVer. We show that EM-SNP outperforms SNVer in terms of the fraction of db-SNPs among the called SNPs, as well as transition/transversion (Ti/Tv) ratio. Finally, the EM approach is used to study the association between variants and type I diabetes.ConclusionsThe EM-based approach for the analysis of pooled sequencing data can accurately estimate minor allele frequencies, call SNPs, and find associations between variants and complex traits. This approach is especially useful for studies involving rare variants.

BackgroundPerforming high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read – or, more likely, none – from a true singleton.ResultsTo improve upon existing theory and software packages, we have developed a Bayesian approach for minor allele frequency (MAF) computation and SNP calling in pools (and implemented it in a program called snape): the approach takes into account sequencing errors and allows users to choose different priors. We also set up a pipeline which can simulate the coalescence process giving rise to the SNPs, the pooling procedure and the sequencing. We used it to compare the performance of snape to that of other packages.ConclusionsWe present a software which helps in calling SNPs in pooled samples: it has good power while retaining a low false discovery rate (FDR). The method also provides the posterior probability that a SNP is segregating and the full posterior distribution of f for every SNP. In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling. In this setting, snape has better power and False Discovery Rate (FDR) than the comparable packages samtools, PoPoolation, Varscan : for N = 50 chromosomes, snape has power ≈ 35%and FDR ≈ 2.5%. snape is available at http://code.google.com/p/snape-pooled/ (source code and precompiled binaries).

SNP Calling Research Articles

Related Topics

Articles published on SNP Calling

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Development of strategies for SNP detection in RNA-seq data: application to lymphoblastoid cell lines and evaluation using 1000 Genomes data.

Exome sequencing of tumors: relevance in copy-number alteration (CNA) analysis and fixed tissue samples.

A dynamic Bayesian Markov model for phasing and characterizing haplotypes in next-generation sequencing

AMY-tree: an algorithm to use whole genome SNP calling for Y chromosomal phylogenetic applications

An Improved Genotyping by Sequencing (GBS) Approach Offering Increased Versatility and Efficiency of SNP Discovery and Genotyping

A unified approach for allele frequency estimation, SNP detection and association studies based on pooled sequencing data using EM algorithms

Multiplex DNA amplification and barcoding in a single reaction for 454 Roche sequencing: A comprehensive study on the control region of the mitochondrial genome

SNP Design from 454 Sequencing of Podosphaera plantaginis Transcriptome Reveals a Genetically Diverse Pathogen Metapopulation with High Levels of Mixed-Genotype Infection

Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data

Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next-Generation Sequencing Data

Library Preparation and Multiplex Capture for Massive Parallel Sequencing Applications Made Efficient and Easy

SNP calling by sequencing pooled samples

Genotyping in the cloud with Crossbow.

G.O.4 Search for SNPs modifiers in DMD with different corticosteroids response by candidate genes targeted resequencing

Genome-Wide Somatic Copy Number Alterations in Low-Grade PanINs and IPMNs from Individuals with a Family History of Pancreatic Cancer

SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data.

High-throughput genotyping in citrus accessions using an SNP genotyping array

AdapterRemoval: easy cleaning of next-generation sequencing reads.

Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

SNP Calling Research Articles

Related Topics

Articles published on SNP Calling

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Development of strategies for SNP detection in RNA-seq data: application to lymphoblastoid cell lines and evaluation using 1000 Genomes data.

Exome sequencing of tumors: relevance in copy-number alteration (CNA) analysis and fixed tissue samples.

A dynamic Bayesian Markov model for phasing and characterizing haplotypes in next-generation sequencing

AMY-tree: an algorithm to use whole genome SNP calling for Y chromosomal phylogenetic applications

An Improved Genotyping by Sequencing (GBS) Approach Offering Increased Versatility and Efficiency of SNP Discovery and Genotyping

A unified approach for allele frequency estimation, SNP detection and association studies based on pooled sequencing data using EM algorithms

Multiplex DNA amplification and barcoding in a single reaction for 454 Roche sequencing: A comprehensive study on the control region of the mitochondrial genome

SNP Design from 454 Sequencing of Podosphaera plantaginis Transcriptome Reveals a Genetically Diverse Pathogen Metapopulation with High Levels of Mixed-Genotype Infection

Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data

Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next-Generation Sequencing Data

Library Preparation and Multiplex Capture for Massive Parallel Sequencing Applications Made Efficient and Easy

SNP calling by sequencing pooled samples

Genotyping in the cloud with Crossbow.

G.O.4 Search for SNPs modifiers in DMD with different corticosteroids response by candidate genes targeted resequencing

Genome-Wide Somatic Copy Number Alterations in Low-Grade PanINs and IPMNs from Individuals with a Family History of Pancreatic Cancer

SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data.

High-throughput genotyping in citrus accessions using an SNP genotyping array

AdapterRemoval: easy cleaning of next-generation sequencing reads.

Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions