Abstract

Transformer and its variants achieve excellent results in various computer vision and natural language processing tasks, but high computational costs and reliance on large training datasets restrict their deployment in resource-constrained settings. Low-rank approximation of model weights has been effective in compressing CNN models, but its application to transformers has been less explored and is less effective. Existing methods require the complete dataset to fine-tune compressed models, which are both time-consuming and data-hungry. This paper reveals that the features (i.e., activations) are low-rank, but model weights are surprisingly not low-rank. Hence, AAFM is proposed, which adaptively determines the compressed model structure and locally compresses each linear layer's output features rather than the model weights. A second stage, GFM, optimizes the entire compressed network holistically. Both AAFM and GFM only use few training samples without labels, that is, they are few-shot, unsupervised, fast and effective. For example, with only 2K images without labels, 33% of the parameters are removed in DeiT-B with 18.8% relative throughput increase, but only a 0.23% accuracy loss for ImageNet recognition. The proposed methods are successfully applied to the language modeling task in NLP, too. Besides, the few-shot compressed models generalize well in downstream tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.