Abstract

Abstract Recent improvements in the predictive quality of natural language processing systems are often dependent on a substantial increase in the number of model parameters. This has led to various attempts of compressing such models, but existing methods have not considered the differences in the predictive power of various model components or in the generalizability of the compressed models. To understand the connection between model compression and out-of-distribution generalization, we define the task of compressing language representation models such that they perform best in a domain adaptation setting. We choose to address this problem from a causal perspective, attempting to estimate the average treatment effect (ATE) of a model component, such as a single layer, on the model’s predictions. Our proposed ATE-guided Model Compression scheme (AMoC), generates many model candidates, differing by the model components that were removed. Then, we select the best candidate through a stepwise regression model that utilizes the ATE to predict the expected performance on the target domain. AMoC outperforms strong baselines on dozens of domain pairs across three text classification and sequence tagging tasks.1

Highlights

  • The rise of deep neural networks (DNNs) has transformed the way we represent language, allowing models to learn useful features directly from raw inputs

  • Our approach addresses each of the three main challenges we identify, as it allows estimating the marginal effect of each model component, is designed and tested for out-of-distribution generalization, and provides estimates for each compressed model performance on an unlabeled target domain

  • 8 We focus on the Named Entity Recognition (NER) task with 6 different English domains: Broadcast Conversation (BC), Broadcast News (BN), Magazine (MZ), Newswire (NW), Telephone Conversation (TC) and Web data (WB)

Read more

Summary

Introduction

The rise of deep neural networks (DNNs) has transformed the way we represent language, allowing models to learn useful features directly from raw inputs. The introduction of the Transformer architecture (Vaswani et al, 2017) and attention-based models (Devlin et al, 2019; Liu et al, 2019; Brown et al, 2020) has improved performance on most natural language processing (NLP) tasks, while facilitating a large increase in model sizes. Since large models require a significant amount of computation and memory during training and inference, there is a growing demand for compressing such models while retaining the most relevant information. While recent attempts have shown promising results (Sanh et al, 2019), they have some limitations. They attempt to mimic the behavior of the larger models without trying to understand the information preserved or lost in the compression process

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call