Abstract

BackgroundRNA sequencing (RNA-seq) measures gene expression levels and permits splicing analysis. Many existing aligners are capable of mapping millions of sequencing reads onto a reference genome. For reads that can be mapped to multiple positions along the reference genome (multireads), these aligners may either randomly assign them to a location, or discard them altogether. Either way could bias downstream analyses. Meanwhile, challenges remain in the alignment of reads spanning across splice junctions. Existing splicing-aware aligners that rely on the read-count method in identifying junction sites are inevitably affected by sequencing depths.ResultsThe distance between aligned positions of paired-end (PE) reads or two parts of a spliced read is dependent on the experiment protocol and gene structures. We here proposed a new method that employs an empirical geometric-tail (GT) distribution of intron lengths to make a rational choice in multireads selection and splice-sites detection, according to the aligned distances from PE and sliced reads.ConclusionsGT models that combine sequence similarity from alignment, and together with the probability of length distribution, could accurately determine the location of both multireads and spliced reads.

Highlights

  • RNA sequencing (RNA-seq) measures gene expression levels and permits splicing analysis

  • We have proposed a maximum likelihood estimation (MLE) method based on a geometric-tail (GT) distribution of intron lengths to determine the alignment positions of PE reads

  • We used an arbitrarily large tuple up to 3,000bp to estimate the distribution of genome-wide intron lengths

Read more

Summary

Introduction

RNA sequencing (RNA-seq) measures gene expression levels and permits splicing analysis. For reads that can be mapped to multiple positions along the reference genome (multireads), these aligners may either randomly assign them to a location, or discard them altogether. Unlike microarrays, RNA-seq has virtually no background signal It has no upper limit for transcript-level quantification, which corresponds to the numbers of fragments sequenced. Sequences matching multiple locations along the reference genome are handled arbitrarily Under such circumstances, these ‘multireads’ are randomly assigned to one of the possible locations. ERANGE, rescues these arbitrarily mapped reads by assigning them in proportion to those uniquely mapped reads [6] Both approaches might distort the abundance of reads that are mapped to paralogous gene families, regions of low sequence complexity or high sequence conservation, thereby affecting virtually all subsequent analysis [7]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call