Abstract
Two fundamental problems concern the handling of large n -gram language models: indexing , that is, compressing the n -grams and associated satellite values without compromising their retrieval speed, and estimation , that is, computing the probability distribution of the n -grams extracted from a large textual source. Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation. Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n -gram following a context of fixed length k , that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n -gram strings. With an extensive experimental analysis performed on billions of n -grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.
Highlights
We present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context
The other related problem that we study in this paper is the one of computing the probability distribution of the n-grams extracted from large textual collections
(1) We introduce a compressed trie data structure in which each level of the trie is modeled as a monotone integer sequence that we encode with Elias-Fano [17, 18] as to efficiently support random access operations and successor queries over the compressed sequence
Summary
The objective is to predict the query by saving keystrokes: this is implemented by reporting the top-k most frequently-searched n-grams that follow the keywords typed by the user [2, 34, 35]. Given the number of users served by large-scale search engines and the high query rates, it is of utmost importance that such data structure traversals are carried out in a handful of microseconds [2, 14, 28, 34, 35]. Another noticeable example is spelling correction in text editors and web search. The primary goal of a language model is to compute the probability of the word wn given its preceding history of n − 1 words, called the context, that is: compute P(wn |w1n−1) for all w n 1 ∈ S.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.