Abstract

BERT is a pre-trained language model. Although the model is proven to be highly performant in a variety of natural language understanding tasks, its large size makes it hard to implement in practical situation where computing resource is limited. In order to improve the model efficiency of BERT for sentiment analysis task, we propose a novel distilled version of BERT. It distills knowledge from the full-size BERT model, which serves as the teacher model. The distilled model efficiently learns the last hidden state and soft label of the teacher model, which are different from previous models. We use distillation learning objective that is able to effectively transfer knowledge from the original big model to the compact model. Our model reduces BERT model size by ∼40%, but retains ∼98.2% of performance in sentiment classification task. Our model achieves promising results in SST-2 sentiment analysis, and outperforms previous distilled model.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.