Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

Ehsan Shareghi,Trevor Cohn,Gholamreza Haffari,Matthias Petri

doi:10.18653/v1/d15-1288

Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

Ehsan Shareghi, Trevor Cohn + Show 2 more

Open Access

https://doi.org/10.18653/v1/d15-1288

Copy DOI

Publication Date: Jan 1, 2015
Citations: 33	License type: cc-by

#Compressed Suffix Tree #Compressed Suffix Trees + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

Efficient methods for storing and querying language models are critical for scaling to large corpora and high Markov orders. In this paper we propose methods for modeling extremely large corpora without imposing a Markov condition. At its core, our approach uses a succinct index ‐ a compressed suffix tree ‐ which provides near optimal compression while supporting efficient search. We present algorithms for on-the-fly computation of probabilities under a Kneser-Ney language model. Our technique is exact and although slower than leading LM toolkits, it shows promising scaling properties, which we demonstrate through1-order modeling over the full Wikipedia collection.

Full Text