Abstract

As next generation sequencing technologies are getting more efficient and less expensive, RNA-Seq is becoming a widely used technique for transcriptome studies. Computational analysis of RNA-Seq data often starts with the mapping of millions of short reads back to the genome or transcriptome, a process in which some reads are found to map equally well to multiple genomic locations (multimapping reads). We have developed the Minimum Unique Length Tool (MULTo), a framework for efficient and comprehensive representation of mappability information, through identification of the shortest possible length required for each genomic coordinate to become unique in the genome and transcriptome. Using the minimum unique length information, we have compared different uniqueness compensation approaches for transcript expression level quantification and demonstrate that the best compensation is achieved by discarding multimapping reads and correctly adjusting gene model lengths. We have also explored uniqueness within specific regions of the mouse genome and enhancer mapping experiments. Finally, by making MULTo available to the community we hope to facilitate the use of uniqueness compensation in RNA-Seq analysis and to eliminate the need to make additional mappability files.

Highlights

  • Next-generation sequencing based methods have in the last couple of years increased enormously in usage

  • Comprehensive uniqueness representation using the minimum unique length We reasoned that instead of storing whether a read of predetermined length is unique at a given genomic coordinate, it would be more efficient to store the minimum length required for each genomic coordinate to be uniquely mappable, which we call the minimum unique length (MUL)

  • When estimating absolute expression levels within a sample, it becomes important that the length normalization is faithfully recapitulating both the length of the expressed transcript isoform [17] and that one carefully compensates for uniqueness or multimapping reads

Read more

Summary

Introduction

Next-generation sequencing based methods have in the last couple of years increased enormously in usage. Common to generation sequencing methods is the fragmentation of DNA or RNA into smaller pieces which are amplified, whereupon short reads from millions of these fragments are sequenced in parallel [1]. The length of the sequenced reads typically ranges from around 25 to 150 base pairs for most applications. The mappability can to some extent be improved by performing paired-end sequencing, where two reads from each DNA or RNA fragment is sequenced – one from each end. In this case a fragment can become uniquely mapped one read is non-uniquely mapping to a repetitive region. Depending upon application, ‘‘multimapping’’ reads are often excluded from analysis since their origin cannot be unambiguously determined

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.