In-memory hash tables for accumulating text vocabularies

Justin Zobel,Steffen Heinz,Hugh E Williams

doi:10.1016/s0020-0190(01)00239-3

Abstract

Searching of large text collections, such as repositories of Web pages, is today one of the commonest uses of computers. For a collection to be searched, it requires an index. One of the main tasks in constructing an index is identifying the set of unique words occurring in the collection, that is, extracting its vocabulary. This vocabulary is used during index construction to accumulate statistics and temporary inverted lists, and at query time both for fetching inverted lists and as a source of information about the repository. In the case of English text, where frequency of occurrence of words is skewed and follows the Zipf distribution [8], vocabulary size is typically smaller than main memory. As an example, in a medium-size collection of around 1 GB of English text derived from the TREC world-wide web data [2], there are around 170 million word occurrences, of which just under 2 million are distinct words. The single most frequent word, “the”, occurs almost 6.5 million times — almost twice as often as the second most frequent word, “of”

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

In-memory hash tables for accumulating text vocabularies

Abstract

Talk to us

Similar Papers

More From: Information Processing Letters

Lead the way for us

Journal: Information Processing Letters	Publication Date: Oct 17, 2001
Citations: 66

Similar Papers

Web-based textual analysis of free-text patient experience comments from a survey in primary care.
Inocencio Daniel Maramba ... Martin Roberts
JMIR Medical Informatics | VOL. 3
Inocencio Daniel Maramba, et. al.Inocencio Daniel Maramba ... Martin Roberts
06 May 2015
JMIR Medical Informatics | VOL. 3

Searchable words on the Web
Hugh E Williams ... Justin Zobel
International Journal on Digital Libraries | VOL. 5
Hugh E Williams, et. al.Hugh E Williams ... Justin Zobel
01 Apr 2005
International Journal on Digital Libraries | VOL. 5

Frequency in Incidental Vocabulary Acquisition Research: An Undefined Concept and Some Consequences
Barry Lee Reynolds ... David Wible
TESOL Quarterly | VOL. 48
Barry Lee Reynolds, et. al.Barry Lee Reynolds ... David Wible
28 Oct 2014
TESOL Quarterly | VOL. 48

Utilizing distinct terms for proximity and phrases in the document for better information retrieval
Muhammad Imran Rafique ... Mehdi Hassan
-
Muhammad Imran Rafique, et. al.Muhammad Imran Rafique ... Mehdi Hassan
01 Dec 2014
01 Dec 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

In-memory hash tables for accumulating text vocabularies

Abstract

Talk to us

Similar Papers

More From: Information Processing Letters