Abstract

Recently, language models based on the Transformer architecture have been predominantly used in AI natural language processing. These models, which have been proven to perform better with more parameters, have led to a significant increase in model size and computational load. ALBERT solves this problem by significantly reducing the number of parameters it retains by repeatedly reusing parameters. Although ALBERT significantly reduces the parameters it maintains, it requires a computational load similar to the original language model due to the reuse process. In this study, we develop a distillation system that decreases the number of times the ALBERT model reuses parameters and progressively reduces the parameters being reused. We propose a representation in this distillation system that can effectively distill the knowledge of the original model and develop a new architecture with reduced computation. Through this system, F-ALBERT, which had about half the computational load compared to the ALBERT model, restored about 98% of the performance of the original model on the GLUE benchmark.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.