Efficient Computation of Sequence Mappability

Mai Alzamel,Juliusz Straszyński,Panagiotis Charalampopoulos,Costas S Iliopoulos,Tomasz Kociumaka,Jakub Radoszewski,Solon P Pissis

doi:10.1007/978-3-030-00479-8_2

Mai Alzamel, Juliusz Straszyński + Show 5 more

Open Access

https://doi.org/10.1007/978-3-030-00479-8_2

Copy DOI

Abstract

Sequence mappability is an important task in genome re-sequencing. In the (k, m)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices \(j \ne i\) such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of \(k=1\). We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in \(\mathcal {O}(n \min \{m^k,\log ^{k+1} n\})\) time and \(\mathcal {O}(n)\) space for \(k=\mathcal {O}(1)\). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show \(\mathcal {O}(n^2)\)-time algorithms to compute all results for a fixed m and all \(k=0,\ldots ,m\) or a fixed k and all \(m=k,\ldots ,n-1\). Finally we show that the (k, m)-mappability problem cannot be solved in strongly subquadratic time for \(k,m = \varTheta (\log n)\) unless the Strong Exponential Time Hypothesis fails.

Highlights

The k-mappability problem Analyzing data derived from massively parallel sequencing experiments often depends on the process of genome assembly via resequencing; namely, assembly with the help of a reference sequence
Given a reference sequence, for every substring of length m in the sequence, we want to count how many additional times this substring appears in the sequence when allowing for a small number k of errors
First we show that a pair of compatible modified substrings implies a pair of length-m substrings at Hamming distance at most k

Summary

Introduction

The k-mappability problem Analyzing data derived from massively parallel sequencing experiments often depends on the process of genome assembly via resequencing; namely, assembly with the help of a reference sequence. M}, return all pairs (X1, X2) ∈ R × R, with X1 = X2, such that X1 and X2 are at Hamming distance at most k This problem has been studied in the average-case model and efficient linear-time algorithms are known under some constraints on the value of k and some assumptions on the elements of R [11,20,29]. 4, we show an algorithm to solve the all-pairs Hamming distance problem for strings over any ordered alphabet that works in O(r m + r logkr+k 4kk log r + output · 2kk log r ) time and O(r m + r 2kk log r ) space. In comparison to the conference version, in particular, we improve the complexity of the main algorithm by a (log n)-factor, remove the dependency on the alphabet size in contribution 3, and apply our techniques to solve the all-pairs Hamming distance problem (contribution 2)

Preliminaries

All-Pairs Hamming Distance Problem

We sort all these sets of strings in O(nk m ≤k

Final Remarks