Adaptable probabilistic mapping of short reads using position specific scoring matrices

Peter Kerpedjiev,Stinus Lindgreen,Jes Frellsen,Anders Krogh

doi:10.1186/1471-2105-15-100

Abstract

BackgroundModern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on position-specific scoring matrices, which can take into account not only the quality scores of the reads but also user-specified models of evolution and data-specific biases.ResultsWe show how evolution, data-specific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D. melanogaster) to a related genome (D. simulans).ConclusionThe presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets.

Highlights

Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome
A position-specific scoring matrix from quality scores Most sequencing machines provide a quality score for each base which is related to the probability of a sequencing error occurring at this position in the read
The algorithm for position-specific scoring matrices (PSSMs) scoring is based on BWA’s mapping algorithm. This method can be applied to other PSSM applications, lowering the number of genomic locations that need to be evaluated and increasing the efficiency of PSSM searches

Summary

Introduction

Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Segemehl [25] uses an enhancedsuffix array to provide fast alignment of insertion/deletion (indel) prone reads, and a similar approach was implemented in the mapping tool used in the sequencing of the first ancient human genome [13]. Programs such as CUSHAW [26] and SOAP3 [27] have begun to use graphics processing units (GPUs) to provide even faster mapping

Methods

Results

Conclusion