Abstract

AbstractWe introduce a theoretical framework for understanding and predicting the complexity of sequence classification tasks, using a novel extension of the theory of Boolean function sensitivity. The sensitivity of a function, given a distribution over input sequences, quantifies the number of disjoint subsets of the input sequence that can each be individually changed to change the output. We argue that standard sequence classification methods are biased towards learning low-sensitivity functions, so that tasks requiring high sensitivity are more difficult. To that end, we show analytically that simple lexical classifiers can only express functions of bounded sensitivity, and we show empirically that low-sensitivity functions are easier to learn for LSTMs. We then estimate sensitivity on 15 NLP tasks, finding that sensitivity is higher on challenging tasks collected in GLUE than on simple text classification tasks, and that sensitivity predicts the performance both of simple lexical classifiers and of vanilla BiLSTMs without pretrained contextualized embeddings. Within a task, sensitivity predicts which inputs are hard for such simple models. Our results suggest that the success of massively pretrained contextual representations stems in part because they provide representations from which information can be extracted by low-sensitivity decoders.

Highlights

  • What makes some tasks harder and others easier for modern machine learning methods?1 In NLP, simple models based on lexical classifiers provide good performance on some tasks, while strong performance on other tasks has been attained only recently with massive pretrained models

  • We propose sensitivity as a theory of complexity for sequence classification tasks, that is, any task involving learning a function from sequences to labels

  • In a survey of 15 major NLP tasks, we find that sensitivity quantitatively predicts how difficult a task is for simple lexical classifiers and neural models, both across tasks and across different inputs for a single task (Section 4)

Read more

Summary

Introduction

What makes some tasks harder and others easier for modern machine learning methods?1 In NLP, simple models based on lexical classifiers provide good performance on some tasks, while strong performance on other tasks has been attained only recently with massive pretrained models. Existing complexity metrics provide limited practical insight. The Chomsky Hierarchy (Chomsky, 1956) is a prominent classification of formal languages by complexity, but it describes asymptotic worst-case complexity and does not provide a measure of how hard it is to achieve high accuracy on realistic task distributions. Kolmogorov complexity (Li and Vitányi, 1993) is uncomputable and becomes well-defined only in the asymptotic limit. Psycholinguistic complexity metrics such as surprisal (Hale, 2001) and dependency length (Gibson, 1998) only capture formal features of the input, without regard to the task

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call