Abstract
Large memory consumption of the neural network language models (NN LMs) prohibits their use in many resource-constrained scenarios. Hence, effective NN LM compression approaches that are independent of NN structures are of great interest. However, previous approaches usually achieve a high compression ratio at the cost of obvious performance loss. In this paper, two recently proposed quantization approaches, product quantization (PQ) and soft binarization are effectively combined to address the issue. PQ decomposes word embedding matrices into a Cartesian product of low dimensional subspaces and quantizes each subspace separately. Soft binarization uses a small number of float scalars and the knowledge distillation technique to recover the performance loss during the binarization. Experiments show that the proposed approaches can achieve a high compression ratio, from 70 to over 100, while still maintaining comparable performance to the uncompressed NN LM on both PPL and word error rate criteria.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have