It’s All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning

Alexey Tikhonov,Max Ryabinin

doi:10.18653/v1/2021.findings-acl.310

Abstract

Commonsense reasoning is one of the key problems in natural language processing, but the relative scarcity of labeled data holds back the progress for languages other than English. Pretrained cross-lingual models are a source of powerful language-agnostic representations, yet their inherent reasoning capabilities are still actively studied. In this work, we design a simple approach to commonsense reasoning which trains a linear classifier with weights of multi-head attention as features. To evaluate this approach, we create a multilingual Winograd Schema corpus by processing several datasets from prior work within a standardized pipeline and measure cross-lingual generalization ability in terms of out-of-sample performance. The method performs competitively with recent supervised and unsupervised approaches for commonsense reasoning, even when applied to other languages in a zero-shot manner. Also, we demonstrate that most of the performance is given by the same small subset of attention heads for all studied languages, which provides evidence of universal reasoning capabilities in multilingual encoders.

Highlights

We demonstrate that in cross-lingual models, there exists a small subset of attention heads specializing in universal commonsense reasoning
The quality of our method improves more significantly than of that suggested by Kocijan et al (2019): this may be explained by a greater parameter count and a higher number of attention heads with more distinct specializations
We offer a simple supervised method to utilize pretrained language models for commonsense reasoning

Summary

Introduction

Neural networks have achieved remarkable progress in numerous tasks involving natural language, such as machine translation (Bahdanau et al, 2014; Kaplan et al, 2020; Arivazhagan et al, 2019), language modeling (Brown et al, 2020), open-domain dialog systems (Adiwardana et al, 2020; Roller et al, 2020), and general-purpose language understanding (Devlin et al, 2019; He et al, 2021). Large Transformer-based masked language models (MLMs) (Devlin et al, 2019) were shown to achieve impressive results on several benchmark datasets for commonsense reasoning (Sakaguchi et al, 2020; Kocijan et al, 2019; Klein and Nabi, 2020). The best-performing methods frequently involve finetuning the entire model on large enough corpora with varying degrees of supervision; apart from providing initial parameter values, the pretrained trained language model is not used for predictions

Objectives

Methods

Results

Conclusion