Handling Massive N -Gram Datasets Efficiently

Giulio Ermanno Pibiri,Rossano Venturini

doi:10.1145/3302913

Abstract

Two fundamental problems concern the handling of large n -gram language models: indexing , that is, compressing the n -grams and associated satellite values without compromising their retrieval speed, and estimation , that is, computing the probability distribution of the n -grams extracted from a large textual source. Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation. Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n -gram following a context of fixed length k , that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n -gram strings. With an extensive experimental analysis performed on billions of n -grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.

Highlights

We present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context
The other related problem that we study in this paper is the one of computing the probability distribution of the n-grams extracted from large textual collections
(1) We introduce a compressed trie data structure in which each level of the trie is modeled as a monotone integer sequence that we encode with Elias-Fano [17, 18] as to efficiently support random access operations and successor queries over the compressed sequence

Summary

Introduction

The objective is to predict the query by saving keystrokes: this is implemented by reporting the top-k most frequently-searched n-grams that follow the keywords typed by the user [2, 34, 35]. Given the number of users served by large-scale search engines and the high query rates, it is of utmost importance that such data structure traversals are carried out in a handful of microseconds [2, 14, 28, 34, 35]. Another noticeable example is spelling correction in text editors and web search. The primary goal of a language model is to compute the probability of the word wn given its preceding history of n − 1 words, called the context, that is: compute P(wn |w1n−1) for all w n 1 ∈ S.

Objectives

Methods

Findings

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: ACM Transactions on Information Systems	Publication Date: Feb 11, 2019
Citations: 24	License type: cc-by

R Discovery Prime

Handling Massive N -Gram Datasets Efficiently

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: ACM Transactions on Information Systems

Lead the way for us

Similar Papers

Natural Language Processing and Computational Linguistics
Junichi Tsujii
Computational Linguistics | VOL. -
Junichi TsujiiJunichi Tsujii
07 Dec 2021
Computational Linguistics | VOL. -

Graph-Based Natural Language Processing and Information Retrieval Rada Mihalcea and Dragomir Radev (University of North Texas and University of Michigan) Cambridge, UK: Cambridge University Press, 2011, viii+192 pp; hardbound, ISBN 978-0-521-89613-9, $65.00
Chris Biemann
Computational Linguistics | VOL. 38
Chris BiemannChris Biemann
01 Mar 2012
Computational Linguistics | VOL. 38

Exploring the language modeling toolkits for Arabic text
Fawaz S Al-Anzi ... Dia Abuzeina
-
Fawaz S Al-Anzi, et. al.Fawaz S Al-Anzi ... Dia Abuzeina
01 Nov 2017
01 Nov 2017

Progress in Neural Network Based Statistical Language Modeling
Anup Shrikant Kunte ... Vahida Z Attar
-
Anup Shrikant Kunte, et. al.Anup Shrikant Kunte ... Vahida Z Attar
30 Oct 2019
30 Oct 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Handling Massive N -Gram Datasets Efficiently

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: ACM Transactions on Information Systems