Abstract

Searching of large text collections, such as repositories of Web pages, is today one of the commonest uses of computers. For a collection to be searched, it requires an index. One of the main tasks in constructing an index is identifying the set of unique words occurring in the collection, that is, extracting its vocabulary. This vocabulary is used during index construction to accumulate statistics and temporary inverted lists, and at query time both for fetching inverted lists and as a source of information about the repository. In the case of English text, where frequency of occurrence of words is skewed and follows the Zipf distribution [8], vocabulary size is typically smaller than main memory. As an example, in a medium-size collection of around 1 GB of English text derived from the TREC world-wide web data [2], there are around 170 million word occurrences, of which just under 2 million are distinct words. The single most frequent word, “the”, occurs almost 6.5 million times — almost twice as often as the second most frequent word, “of”

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.