A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set mathrm {M}^{ell }_{{y_1,ldots ,y_k}} of minimal absent words of length at most ℓ of the collection {y1, … , yk}. The set mathrm {M}^{ell }_{{y_1,ldots ,y_k}} contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set mathrm {M}^{ell }_{y} of minimal absent words of a word y is equal to mathrm {M}^{ell }_{{y_1,ldots ,y_k}} for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available mathcal {O}(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when | mathrm {M}^{ell }_{{y_1,ldots ,y_N}}| =o(n), for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but | mathrm {M}^{12}_{{y_1,ldots ,y_k}}| approx 10^{6}. We consider a constant-sized alphabet for stating our results. We show that allmathrm {M}^{ell }_{y_{1}},ldots ,mathrm {M}^{ell }_{{y_1,ldots ,y_k}} can be computed in mathcal {O}(kn+{sum }^{k}_{N=1}| mathrm {M}^{ell }_{{y_1,ldots ,y_N}}| ) total time using mathcal {O}(textsc {MaxIn}+textsc {MaxOut}) space, where MaxIn is the length of the longest word in {y1, … , yk} and textsc {MaxOut}=max limits {| mathrm {M}^{ell }_{{y_1,ldots ,y_N}}| :Nin [1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
Read full abstract