High-throughput identification, database storage and analysis of SNPs in EST sequences.

F J Useche,M Harafey,A Rafalski,G Gao

doi:10.11234/gi1990.12.194

F J Useche, M Harafey + Show 2 more

https://doi.org/10.11234/gi1990.12.194

Copy DOI

Journal: Genome Informatics	Publication Date: Jan 1, 2001
Citations: 48	License type: free

Affiliation: University of Delaware

Abstract

Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA variation and disease-causing mutations in many genes. Due to their abundance and slow mutation rate within generations, they are thought to be the next generation of genetic markers that can be used in a myriad of important biological, genetic, pharmacological, and medical applications. There are several strategies both experimental, and in-silico for SNP discovery and mapping. Experimental SNP discovery consists of a number of labourious steps that make this process complex and expensive. In-silico discovery has been proposed as an alternative discovery method that makes use and takes advantage of large data sets with potential SNP information that have been generated with other purposes and have not been used as a SNP information source yet. However, in order to successfully apply the in-silico method to large data sets, the following challenges need to be addressed: First it is necessary to build an integrated SNP pipeline that handles data processing steps smoothly from the beginning (collecting sequence information) to end (SNPs in the database). Also, SNP detection tool parameters have to be optimized to satisfy specific goals of the project. Finally, SNP data could not be fully used until the in-silico method is validated experimentally. In this paper we present a design and implementation of an in-silico SNP detection software pipeline that exploits the existence of large EST (expressed sequence tag) data sets and effectively addresses the above challenges. First, the pipeline allows for smooth data transition between its different components by implementing data interfaces that translate the data formats of the different tools in the different stages. Second, we optimized PolyBayes parameters for SNP detection in maize EST. Finally, we implemented a user interface that along with the database structure created allows the scientist to perform preliminary analysis of the data and to perform basic statistics on the SNP data prior to experimental validation. The pipeline works with two different types of sequence assemblers (PHRAP (http://www.phrap.org/) and CAT from DoubleTwist (http://www.doubletwist.com/). It uses a Bayesian engine for SNP detection (PolyBayes), selects relevant polymorphism information which is then uploaded into a database. We detected 2439 SNPs and 822 insertion deletions (INDELs) with a PolyBayes probability higher than 0.99 on the public set of 68,000 maize ESTs. The user interface allowed us analyzing the polymorphism information right after discovery in several ways that allowed us to gain insight into the distribution and significance of the newly acquired data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

High-throughput identification, database storage and analysis of SNPs in EST sequences.

Abstract

Talk to us

Similar Papers

More From: Genome Informatics

Lead the way for us

Similar Papers

Mining SNPs from EST sequences using filters and ensemble classifiers
J Wang ... Q Zou
Genetics and Molecular Research | VOL. 9
J Wang, et. al.J Wang ... Q Zou
01 Jan 2009
Genetics and Molecular Research | VOL. 9

SNP Discovery with EST and NextGen Sequencing in Switchgrass (Panicum virgatum L.)
Elhan S Ersoz ... Mark H Wright
PLoS ONE | VOL. 7
Elhan S Ersoz, et. al.Elhan S Ersoz ... Mark H Wright
25 Sep 2012
PLoS ONE | VOL. 7

Automated SNP detection in expressed sequence tags: statistical considerations and application to maritime pine sequences.
Loïck Le Dantec ... Patrick Léger
Plant molecular biology | VOL. 54
Loïck Le Dantec, et. al.Loïck Le Dantec ... Patrick Léger
01 Feb 2004
Plant molecular biology | VOL. 54

Advances in forest tree genomics
Christophe Plomion ... John Mackay
New Phytologist | VOL. 166
Christophe Plomion, et. al.Christophe Plomion ... John Mackay
03 May 2005
New Phytologist | VOL. 166

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

High-throughput identification, database storage and analysis of SNPs in EST sequences.

Abstract

Talk to us

Similar Papers

More From: Genome Informatics