Probabilistic base calling of Solexa sequencing data

Jacques Rougemont,Laurent Farinelli,Felix Naef,Christian Iseli,Ioannis Xenarios,Arnaud Amzallag

doi:10.1186/1471-2105-9-431

Abstract

BackgroundSolexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology.ResultsWe propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads.ConclusionWe show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.

Highlights

Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags by parallel sequencing-by-synthesis of DNA colonies
Synthesis efficiency is limited and within each colony, some DNA strands incorporate a non-complementary base or are de-synchronized because they failed to incorporate a nucleotide at a previous step
Cokus et al.[12] use Solexa's pre-treated data (_sig2 files) and apply a very similar EM procedure to fit a Gaussian mixture model for probabilistic base calling. They do not use information based metrics to reduce the probabilities to IUPAC codes, but rather construct position-weight matrices with which they scan the reference genome, which is computationally expensive and not directly applicable for de-novo sequencing

Summary

Introduction

Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. Ultra-high-throughput sequencing is having a growing impact on biological research by providing a fast and high resolution access to genome-scale information. While the sample processing is relatively streamlined, innovations in data management and information processing are necessary to exploit the full potential of the technology. Developing new algorithms to extract more information from available images and reduce the number of sequencing runs per project will prove extremely (page number not for citation purposes)

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 13, 2008
Citations: 171	License type: cc-by

R Discovery Prime

R Discovery Prime

Probabilistic base calling of Solexa sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Data processing pipelines for comprehensive profiling of proteomics samples by label-free LC–MS for biomarker discovery
Christin Christin ... Péter Horvatovich
Talanta | VOL. 83
Christin Christin, et. al.Christin Christin ... Péter Horvatovich
11 Nov 2010
Talanta | VOL. 83

Cryoxcellia borchgrevinki gen. nov., sp. nov., a new parasitic X-cell species in an Antarctic nototheniid fish, the bald notothen Trematomus borchgrevinki
Clive W Evans ... Nicholas J Matzke
Polar Biology | VOL. 46
Clive W Evans, et. al.Clive W Evans ... Nicholas J Matzke
28 Apr 2023
Polar Biology | VOL. 46

A Unique Mitochondrial Gene Block Inversion in Antarctic Trematomin Fishes: A Cautionary Tale.
Selina Patel ... Craig D Millar
Journal of Heredity | VOL. 113
Selina Patel, et. al.Selina Patel ... Craig D Millar
03 Jun 2022
Journal of Heredity | VOL. 113

Comparison of high-throughput single-cell RNA sequencing data processing pipelines.
Mingxuan Gao ... Rongshan Yu
Briefings in bioinformatics | VOL. 22
Mingxuan Gao, et. al.Mingxuan Gao ... Rongshan Yu
07 Jul 2020
Briefings in bioinformatics | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Probabilistic base calling of Solexa sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics