Natural Language Compression Optimized for Large Set of Files

P Prochazka,J Holub

doi:10.1109/dcc.2013.93

Abstract

Summary form only given. The web search engines store the web pages in the raw text form to build so called snippets (short text surrounding the searched pattern) or to perform so called positional ranking functions. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm Set-of-Files Semi-Adaptive Two Byte Dense Code (SF-STBDC) is based on the word-based approach and the idea of combination of two statistical models: the global model (common for all the files of the set) and the local model. The latter is built as the set of changes which transform the global model to the proper model of the single compressed file. Except very good compression ratio the compression method allows fast searching on the compressed text, which is an attractive property especially for search engines property especially for search engines. Exactly the same problem (compression of a set of files using byte codes) was first stated in. Our algorithm SF-STBDC overcomes the algorithm based on (s,c) - Dense Code in compression ratio and at the same time it keeps a very good searching and decompression speed. The key idea to achieve this result is a usage of Semi-Adaptive Two Byte Dense Code which provides more effective coding of small portions ofof the text and still allows exact setting of the number of stoppers and continuers.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Natural Language Compression Optimized for Large Set of Files

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Lightweight natural language text compression
Nieves R Brisaboa ... José R Paramá
Information Retrieval | VOL. 10
Nieves R Brisaboa, et. al.Nieves R Brisaboa ... José R Paramá
09 Sep 2006
Information Retrieval | VOL. 10

Natural Language Compression per Blocks
Petr Proch´Zka ... Jan Holub
-
Petr Proch´Zka, et. al.Petr Proch´Zka ... Jan Holub
01 Jun 2011
01 Jun 2011

(S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases
Nieves R Brisaboa ... María F Esteller
-
Nieves R Brisaboa, et. al.Nieves R Brisaboa ... María F Esteller
01 Jan 2003
01 Jan 2003

ODC: Frame for definition of Dense codes
Petr Procházka ... Jan Holub
European Journal of Combinatorics | VOL. 34
Petr Procházka, et. al.Petr Procházka ... Jan Holub
07 Sep 2012
European Journal of Combinatorics | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Natural Language Compression Optimized for Large Set of Files

Abstract

Talk to us

Similar Papers