Pattern-matching and text-compression algorithms

Maxime Crochemore,Thierry Lecroq

doi:10.1145/234313.234331

Maxime Crochemore, Thierry Lecroq

Open Access

PDF Available

https://doi.org/10.1145/234313.234331

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Pattern matching is the problem of locating a specific pattern inside raw data. The pattern is usually a collection of strings described in some formal language. Applications require two kinds of solution depending upon which string, the pattern, or the text, is given first. Solutions based on the use of automata or combinatorial properties of strings are commonly implemented to preprocess the pattern. The notion of indices realized by trees or automata is used in the second kind of solutions. The aim of data compression is to provide representation of data in a reduced form in order to save both storage place and transmission time. There is no loss of information, the compression processes are reversible. Pattern-matching and text-compression algorithms are two important subjects in the wider domain of text processing. They apply to the manipulation of texts (word editors), to the storage of textual data (text compression), and to data retrieval systems (full text search). They are basic components used in implementations of practical softwares existing under most operating systems. Moreover, they emphasize programming methods that serve as paradigms in other fields of computer science (system or software design). Finally, they also play an important role in theoretical computer science by providing challenging problems. Although data are recorded in various ways, text remains the main way to exchange information. This is particularly evident in literature or linguistics where data are composed of huge corpora and dictionaries, but applies as well to computer science where a large amount of data is stored in linear files. And it is also the case, for instance, in molecular biology because biological molecules can often be approximated as sequences of nucleotides or amino acids. Furthermore, the quantity of available data in these fields tend to double every 18 months. This is the reason that algorithms must be efficient even if the speed and storage capacity of computers increase continuously.

Full Text