Abstract

Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. However, while pretraining methods are very convenient, they are expensive in terms of time and resources. This calls for a study of the impact of pretraining data size on the knowledge of the models. We explore this impact on the syntactic capabilities of RoBERTa, using models trained on incremental sizes of raw text data. First, we use syntactic structural probes to determine whether models pretrained on more data encode a higher amount of syntactic information. Second, we perform a targeted syntactic evaluation to analyze the impact of pretraining data size on the syntactic generalization performance of the models. Third, we compare the performance of the different models on three downstream applications: part-of-speech tagging, dependency parsing and paraphrase identification. We complement our study with an analysis of the cost-benefit trade-off of training such models. Our experiments show that while models pretrained on more data encode more syntactic knowledge and perform better on downstream applications, they do not always offer a better performance across the different syntactic phenomena and come at a higher financial and environmental cost.

Highlights

  • Our experiments show that while models pretrained on more data encode more syntactic knowledge and perform better on downstream applications, they do not always offer a better performance across the different syntactic phenomena and come at a higher financial and environmental cost. They encode more syntactic information and offer a better syntactic generalization? (iii) Do the models with more pretraining perform better when applied in downstream tasks? To address these questions, we explore the relation between the size of the pretraining data and the syntactic capabilities of RoBERTa by means of the MiniBERTas models, a set of 12 RoBERTa models pretrained from scratch by Warstadt et al (2020b) on quantities of data

  • We use the syntactic structural probes from Hewitt and Manning (2019b) to determine whether those models pretrained on more data encode a higher amount of syntactic information than those trained on less data;

  • We explore the impact of the size of pretraining data on the syntactic information encoded by RoBERTa from three different angles

Read more

Summary

Introduction

RoBERTa by means of the MiniBERTas models, a set of 12 RoBERTa models pretrained from scratch by Warstadt et al (2020b) on quantities of data. When applied to downstream tasks, the models pretrained on more data perform generally betphenomena (Conneau et al, 2018; Liu et al, 2019a; Tenney et al, 2019; Voita and Titov, 2020; Elazar et al, 2020), and it has been shown that entire syntax trees are embedded implicitly in BERT’s vector geometry (Hewitt and Manning, 2019b; Chi et al, 2020). Other works have criticized some probing methods, claiming that classifier probes can learn the linguistic task from training data (Hewitt and Liang, 2019), and can fail to determine whether the detected features are used (Voita and Titov, 2020; Pimentel et al, 2020; Elazar et al, 2020).

Costs of modern language models
Background
The MiniBERTas models to identify the root of the sentence as the least deep
Agreement
Structural probing
Garden-Path Effects
Licensing
Long-Distance Dependencies
Results
Cost-benefit analysis
Discusion and conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call