Abstract

BackgroundMassively parallel sequencing offers an enormous potential for expression profiling, in particular for interspecific comparisons. Currently, different platforms for massively parallel sequencing are available, which differ in read length and sequencing costs. The 454-technology offers the highest read length. The other sequencing technologies are more cost effective, on the expense of shorter reads. Reliable expression profiling by massively parallel sequencing depends crucially on the accuracy to which the reads could be mapped to the corresponding genes.Methodology/Principal FindingsWe performed an in silico analysis to evaluate whether incorrect mapping of the sequence reads results in a biased expression pattern. A comparison of six available mapping software tools indicated a considerable heterogeneity in mapping speed and accuracy. Independently of the software used to map the reads, we found that for compact genomes both short (35 bp, 50 bp) and long sequence reads (100 bp) result in an almost unbiased expression pattern. In contrast, for species with a larger genome containing more gene families and repetitive DNA, shorter reads (35–50 bp) produced a considerable bias in gene expression. In humans, about 10% of the genes had fewer than 50% of the sequence reads correctly mapped. Sequence polymorphism up to 9% had almost no effect on the mapping accuracy of 100 bp reads. For 35 bp reads up to 3% sequence divergence did not affect the mapping accuracy strongly. The effect of indels on the mapping efficiency strongly depends on the mapping software.Conclusions/SignificanceIn complex genomes, expression profiling by massively parallel sequencing could introduce a considerable bias due to incorrectly mapped sequence reads if the read length is short. Nevertheless, this bias could be accounted for if the genomic sequence is known. Furthermore, sequence polymorphisms and indels also affect the mapping accuracy and may cause a biased gene expression measurement. The choice of the mapping software is highly critical and the reliability depends on the presence/absence of indels and the divergence between reads and the reference genome. Overall, we found SSAHA2 and CLC to produce the most reliable mapping results.

Highlights

  • Technological advances have revolutionized the analysis of the transcriptome, the set of genes expressed in a given tissue

  • We were interested to study how the accuracy of mapping short reads is affected by size of the reads, complexity of the reference genome, and the mapping algorithm used

  • As quantitative transcript profiling by massively parallel sequencing is potentially affected by the accuracy of the mapping of short reads, we performed an in silico analysis to evaluate this

Read more

Summary

Introduction

Technological advances have revolutionized the analysis of the transcriptome, the set of genes expressed in a given tissue. PCR amplified cDNA fragments are spotted at a high density (10–50 spots per mm2) onto a microscope slide and probed against a labelled target. This technique offers the advantage that it is rather insensitive to mismatches between the probe and the cDNA sequence. Different platforms for massively parallel sequencing are available, which differ in read length and sequencing costs. Reliable expression profiling by massively parallel sequencing depends crucially on the accuracy to which the reads could be mapped to the corresponding genes

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call