Efficient Computation of Sequence Mappability

Panagiotis Charalampopoulos,Jakub Radoszewski,Costas S Iliopoulos,Juliusz Straszyński,Tomasz Kociumaka,Solon P Pissis

doi:10.1007/s00453-022-00934-y

Abstract

Sequence mappability is an important task in genome resequencing. In the (k, m)-mappability problem, for a given sequence T of length n, the goal is to compute a table whose ith entry is the number of indices j ne i such that the length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for k=O(1), works in O(n) space and, with high probability, in O(n cdot min {m^k,log ^k n}) time. Our algorithm requires a careful adaptation of the k-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop O(n^2)-time algorithms to compute all (k, m)-mappability tables for a fixed m and all kin {0,ldots ,m} or a fixed k and all min {k,ldots ,n}. Finally, we show that, for k,m = Theta (log n), the (k, m)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018.

Highlights

The k-mappability problem Analyzing data derived from massively parallel sequencing experiments often depends on the process of genome assembly via resequencing; namely, assembly with the help of a reference sequence
Given a reference sequence, for every substring of length m in the sequence, we want to count how many additional times this substring appears in the sequence when allowing for a small number k of errors
First we show that a pair of compatible modified substrings implies a pair of length-m substrings at Hamming distance at most k

Summary

Introduction

The k-mappability problem Analyzing data derived from massively parallel sequencing experiments often depends on the process of genome assembly via resequencing; namely, assembly with the help of a reference sequence. M}, return all pairs (X1, X2) ∈ R × R, with X1 = X2, such that X1 and X2 are at Hamming distance at most k This problem has been studied in the average-case model and efficient linear-time algorithms are known under some constraints on the value of k and some assumptions on the elements of R [11,20,29]. 4, we show an algorithm to solve the all-pairs Hamming distance problem for strings over any ordered alphabet that works in O(r m + r logkr+k 4kk log r + output · 2kk log r ) time and O(r m + r 2kk log r ) space. In comparison to the conference version, in particular, we improve the complexity of the main algorithm by a (log n)-factor, remove the dependency on the alphabet size in contribution 3, and apply our techniques to solve the all-pairs Hamming distance problem (contribution 2)

Preliminaries

All-Pairs Hamming Distance Problem

We sort all these sets of strings in O(nk m ≤k

Final Remarks