Automated SNP detection from a large collection of white spruce expressed sequences: contributing factors and approaches for the categorization of SNPs.

Nathalie Pavy,Charles Paule,Jean Bousquet,John Mackay,Lee S Parsons

doi:10.1186/1471-2164-7-174

Nathalie Pavy, Charles Paule + Show 3 more

Open Access

https://doi.org/10.1186/1471-2164-7-174

Copy DOI

Journal: BMC Genomics	Publication Date: Jul 6, 2006
Citations: 104	License type: CC BY 2.0

Affiliation: Université Laval, University of Minnesota

Abstract

BackgroundHigh-throughput genotyping technologies represent a highly efficient way to accelerate genetic mapping and enable association studies. As a first step toward this goal, we aimed to develop a resource of candidate Single Nucleotide Polymorphisms (SNP) in white spruce (Picea glauca [Moench] Voss), a softwood tree of major economic importance.ResultsA white spruce SNP resource encompassing 12,264 SNPs was constructed from a set of 6,459 contigs derived from Expressed Sequence Tags (EST) and by using the bayesian-based statistical software PolyBayes. Several parameters influencing the SNP prediction were analysed including the a priori expected polymorphism, the probability score (PSNP), and the contig depth and length. SNP detection in 3' and 5' reads from the same clones revealed a level of inconsistency between overlapping sequences as low as 1%. A subset of 245 predicted SNPs were verified through the independent resequencing of genomic DNA of a genotype also used to prepare cDNA libraries. The validation rate reached a maximum of 85% for SNPs predicted with either PSNP ≥ 0.95 or ≥ 0.99. A total of 9,310 SNPs were detected by using PSNP ≥ 0.95 as a criterion. The SNPs were distributed among 3,590 contigs encompassing an array of broad functional categories, with an overall frequency of 1 SNP per 700 nucleotide sites. Experimental and statistical approaches were used to evaluate the proportion of paralogous SNPs, with estimates in the range of 8 to 12%. The 3,789 coding SNPs identified through coding region annotation and ORF prediction, were distributed into 39% nonsynonymous and 61% synonymous substitutions. Overall, there were 0.9 SNP per 1,000 nonsynonymous sites and 5.2 SNPs per 1,000 synonymous sites, for a genome-wide nonsynonymous to synonymous substitution rate ratio (Ka/Ks) of 0.17.ConclusionWe integrated the SNP data in the ForestTreeDB database along with functional annotations to provide a tool facilitating the choice of candidate genes for mapping purposes or association studies.

Highlights

High-throughput genotyping technologies represent a highly efficient way to accelerate genetic mapping and enable association studies
PolyBayes uses a priori information about the average pairwise difference between paralogous sequences to calculate a posteriori, the probability that a sequence is native by comparison to a reference sequence from the Expressed Sequence Tags (EST) cluster
We estimated Single Nucleotide Polymorphisms (SNP) diversity and distribution parameters in 6,459 contigs each derived from sequences of at least two cDNA clones with the PolyBayes software

Summary

Introduction

High-throughput genotyping technologies represent a highly efficient way to accelerate genetic mapping and enable association studies. Bayesian statistics were applied to incorporate background information into the specification of a tested model for data analysis [14] They were implemented in the software PolyBayes to determine a confidence score for each SNP detected in a cluster of ESTs [15]. Based on the alignment of the ESTs within the cluster, another Bayesian calculation generates the probability that a variant at a given location of a multiple alignment represents a true polymorphism as opposed to a sequencing error This calculation takes into account the alignment depth, the base calls in each of the sequences, the associated base quality values, the base composition in the region, and the expected a priori rate of polymorphism of the species under investigation. This approach was shown to be adequate for SNP prediction in human [15], sugarcane [16], soybean [17], and pine [18]

Objectives

Methods

Results

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automated SNP detection from a large collection of white spruce expressed sequences: contributing factors and approaches for the categorization of SNPs.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Automated SNP detection in expressed sequence tags: statistical considerations and application to maritime pine sequences.
Loïck Le Dantec ... Virginie Garcia
Plant Molecular Biology | VOL. 54
Loïck Le Dantec, et. al.Loïck Le Dantec ... Virginie Garcia
01 Feb 2004
Plant Molecular Biology | VOL. 54

High-throughput identification, database storage and analysis of SNPs in EST sequences.
F J Useche ... G Gao
Genome Informatics | VOL. 12
F J Useche, et. al.F J Useche ... G Gao
01 Jan 2001
Genome Informatics | VOL. 12

SNP detection and prediction of variability between chicken lines using genome resequencing of DNA pools
Stefan Marklund ... Örjan Carlborg
BMC Genomics | VOL. 11
Stefan Marklund, et. al.Stefan Marklund ... Örjan Carlborg
01 Jan 2009
BMC Genomics | VOL. 11

EST, COSII, and arbitrary gene markers give similar estimates of nucleotide diversity in cultivated tomato (Solanum lycopersicum L.)
Joanne A Labate ... Steven D Tanksley
TAG Theoretical and Applied Genetics | VOL. 118
Joanne A Labate, et. al.Joanne A Labate ... Steven D Tanksley
20 Jan 2009
TAG Theoretical and Applied Genetics | VOL. 118

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automated SNP detection from a large collection of white spruce expressed sequences: contributing factors and approaches for the categorization of SNPs.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics