Abstract

In this paper, we develop a novel Joint Model Compression (referred to as JMC) method by combining structured pruning and dense knowledge distillation techniques to significantly compress original large language model into a deep compressed shallow network. In particular, a new Direct Importance-aware Structured Pruning (referred as DISP) approach is proposed to structurally prune the redundant structures in the Transformer networks directly based on the corresponding parameter matrices in the model. Besides, a Dense Knowledge Distillation (referred to as DKD) method is developed with a many-to-one layer mapping strategy to leverage more comprehensive layer-wise linguistic knowledge for the distillation. Further, the proposed structured pruning and dense knowledge distillation are integrated together to perform the joint compression, which enables us to achieve a significant compression without sacrificing model accuracy. The extensive experimental results across four NLP tasks on seven datasets demonstrate its effectiveness and superiority to the baselines, while maintaining similar performance to original large model with further remarkable benefits for inference-time speedup and memory efficiency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call