Neural Network Language Model Compression With Product Quantization and Soft Binarization

Kai Yu,Rao Ma,Kaiyu Shi,Qi Liu

doi:10.1109/taslp.2020.3015659

Kai Yu, Rao Ma + Show 2 more

https://doi.org/10.1109/taslp.2020.3015659

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Large memory consumption of the neural network language models (NN LMs) prohibits their use in many resource-constrained scenarios. Hence, effective NN LM compression approaches that are independent of NN structures are of great interest. However, previous approaches usually achieve a high compression ratio at the cost of obvious performance loss. In this paper, two recently proposed quantization approaches, product quantization (PQ) and soft binarization are effectively combined to address the issue. PQ decomposes word embedding matrices into a Cartesian product of low dimensional subspaces and quantizes each subspace separately. Soft binarization uses a small number of float scalars and the knowledge distillation technique to recover the performance loss during the binarization. Experiments show that the proposed approaches can achieve a high compression ratio, from 70 to over 100, while still maintaining comparable performance to the uncompressed NN LM on both PPL and word error rate criteria.

Full Text