Compressed indexes for text with wildcards

Chris Thachuk

doi:10.1016/j.tcs.2012.08.011

Chris Thachuk

Open Access

https://doi.org/10.1016/j.tcs.2012.08.011

Copy DOI

Journal: Theoretical Computer Science	Publication Date: Aug 24, 2012
Citations: 4	License type: elsevier-specific: oa user license

Affiliation: University of British Columbia

Abstract

We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs)—positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity while maintaining the query time complexity of previous approaches by giving a compressed index requiring 2nHk(T)+o(nlogσ)+O(n+dlogn) bits for a text T of length n over an alphabet of size σ containing d groups of wildcards. The new index is particularly favorable for larger alphabets and comparable for smaller alphabets, such as DNA. A key to the space reduction is a result we give showing how any compressed suffix array can be supplemented with auxiliary data structures occupying O(n)+O(dlognd) bits to also support efficient dictionary matching queries. We discuss how the space can be reduced further by a number of approaches and by allowing an increase in the worst case query time. We also present a new query algorithm for our wildcard indexes that can greatly reduce the query working space.

Full Text