Abstract

Transfer learning from large-scale language models is witnessing incredible growth and popularity in natural language processing (NLP). However, operating these large models always requires a huge amount of computational power and training effort. Many applications leveraging these large models are not very feasible for industrial products since applying them into power-scarce devices, such as mobile phone, is extremely challenging. In this case, model compression, i.e. transform deep and large networks to shallow and small ones, is becoming a popular research trend in NLP community. Currently, there are many techniques available, such as weight pruning and knowledge distillation. The primary concern regarding these techniques is how much of the language understanding capabilities will be retained by the compressed models in a particular domain? In this paper, we conducted a comparative analyses between several popular large-scale language models, such as BERT, RoBERTa, XLNet-Large and their compressed variants, e.g. Distilled BERT, Distilled RoBERTa and etc, and evaluated their performances on three datasets in the social media domain. Experimental results demonstrate that the compressed language models, though consume less computational resources, are able to achieve approximately the same level of language understanding capabilities as the large-scale language models in the social media domain.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call