Compression of indexes with full positional information in very large text databases

Gordon Linoff,Craig Stanfill

doi:10.1145/160688.160699

Compression of indexes with full positional information in very large text databases

Gordon Linoff, Craig Stanfill

https://doi.org/10.1145/160688.160699

Copy DOI

Publication Date: Jan 1, 1993

Citations: 18

#Large Text Databases #Run-Length Encoding + Show 8 more

Abstract
Full-Text
Similar Papers

Abstract

This paper describes a combination of compression methods which may be used to reduce the size of inverted indexes for very large text databases. These methods are Prefix Omission, Run-Length Encoding, and a novel family of numeric representations called n-s coding. Using these compression methods on two different text sources (the King James Version of the Bible and a sample of Wall Street Journal Stories), the compressed index occupies less than 40% of the size of the original text, even when both stopwords and numbers are included in the index. The decreased time required for I/O can almost fully compensate for the time needed to uncompress the postings. This research is part of an effort to handle very large text databases on the CM-5, a massively parallel MIMD supercomputer.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.