TaxiNLI: Taking a Ride up the NLU Hill

Pratik Joshi,Somak Aditya,Aalok Sathe,Monojit Choudhury

doi:10.18653/v1/2020.conll-1.4

Abstract

Pre-trained Transformer-based neural architectures have consistently achieved state-of-the-art performance in the Natural Language Inference (NLI) task. Since NLI examples encompass a variety of linguistic, logical, and reasoning phenomena, it remains unclear as to which specific concepts are learnt by the trained systems and where they can achieve strong generalization. To investigate this question, we propose a taxonomic hierarchy of categories that are relevant for the NLI task. We introduce TaxiNLI, a new dataset, that has 10k examples from the MNLI dataset with these taxonomic labels. Through various experiments on TaxiNLI, we observe that whereas for certain taxonomic categories SOTA neural models have achieved near perfect accuracies—a large jump over the previous models—some categories still remain difficult. Our work adds to the growing body of literature that shows the gaps in the current NLI systems and datasets through a systematic presentation and analysis of reasoning categories.

Highlights

The Natural Language Inference (NLI) task tests whether a hypothesis (H) in text contradicts with, is entailed by, or is neutral with respect to a given premise (P) text
This 3-way classification task, popularized by Bowman et al (2015), which was in turn inspired by Dagan et al (2005), serves as a benchmark for evaluation of natural language understanding (NLU) capability of models; for example, NLI datasets (Bowman et al, 2015; Williams et al, 2018) are included in all NLU benchmarks such as GLUE and SuperGLUE (Wang et al, 2018)
We propose a taxonomy of the various reasoning tasks that are commonly covered by the current NLI datasets (Sec 2)

Summary

Introduction

The Natural Language Inference (NLI) task tests whether a hypothesis (H) in text contradicts with, is entailed by, or is neutral with respect to a given premise (P) text This 3-way classification task, popularized by Bowman et al (2015), which was in turn inspired by Dagan et al (2005), serves as a benchmark for evaluation of natural language understanding (NLU) capability of models; for example, NLI datasets (Bowman et al, 2015; Williams et al, 2018) are included in all NLU benchmarks such as GLUE and SuperGLUE (Wang et al, 2018). As models have shown steady performance increases in NLI tasks, many authors (Nie et al, 2019; Kaushik et al, 2019) demonstrate steep drops in performance when these models are tested against adversarially (or counterfactually) created examples by non-experts. Richardson et al (2019) use templated examples to show trained NLI systems fail to capture essential logical (negation, boolean, quantifier) and semantic (monotonicity) phenomena

Objectives

Findings

Discussion

Conclusion