Homomorphic Compression: Making Text Processing on Compression Unlimited

Kuangyu Chen,Anqun Pan,Yihua Hu,Feng Zhang,Yuxing Chen,Xiaoyong Du,Siqi Ma,Jiawei Guan

doi:10.1145/3626765

Abstract

Lossless data compression is an effective way to handle the huge transmission and storage overhead of massive text data. Its utility is even more significant today when data volumes are skyrocketing. The concept of operating on compressed data infuses new blood into efficient text management by enabling mainly access-oriented text processing tasks to be done directly on compressed data without decompression. Facing limitations of the existing compressed text processing schemes such as limited types of operations supported, low efficiency, and high space occupation, we address these problems by proposing a homomorphic compression theory. It enables the generalization and characterization of algorithms with compression processing capabilities. On this basis, we develop HOCO, an efficient text data management engine that supports a variety of processing tasks on compressed text. We select three representative compression schemes and implement them combined with homomorphism in HOCO. HOCO supports the extension of homomorphic compression schemes through a modular and object-oriented design and has convenient interfaces for text processing tasks. We evaluate HOCO on six real-world datasets. The three schemes implemented in HOCO show trade-offs in terms of compression ratio, supported operation types, and efficiency. Experiments also show that HOCO can achieve higher throughput in random access and modification operations (averagely 9.18× than the state-of-the-art) and lower latency in text analytic tasks (averagely 7.16× than processing on uncompressed text) without compromising compression efficacy.

Full Text