Abstract

While the attention heatmaps produced by neural machine translation (NMT) models seem insightful, there is little evidence that they reflect a model’s true internal reasoning. We provide a measure of faithfulness for NMT based on a variety of stress tests where attention weights which are crucial for prediction are perturbed and the model should alter its predictions if the learned weights are a faithful explanation of the predictions. We show that our proposed faithfulness measure for NMT models can be improved using a novel differentiable objective that rewards faithful behaviour by the model through probability divergence. Our experimental results on multiple language pairs show that our objective function is effective in increasing faithfulness and can lead to a useful analysis of NMT model behaviour and more trustworthy attention heatmaps. Our proposed objective improves faithfulness without reducing the translation quality and has a useful regularization effect on the NMT model and can even improve translation quality in some cases.

Highlights

  • How trustworthy are our neural models? This question has led to a wide variety of contemporary NLP research focusing on (a) different axes of interpretability including plausibility (Herman, 2017; Lage et al, 2019) and faithfulness (Lipton, 2018; Jacovi and Goldberg, 2020b), (b) interpretation of the neural model components (Belinkov et al, 2017; Dalvi et al, 2017; Vig and Belinkov, 2019), (c) explaining the decisions made by neural models to humans (Ribeiro et al, 2016; Li et al, 2016; Ding et al, 2017; Ghaeini et al, 2018; Bastings et al, 2019; Jain et al, 2020), and (d) evaluating different explanation methods from different perspectives attention weights

  • Our findings show that our objective is effective in increasing faithfulness and can lead to a useful analysis of neural machine translation (NMT) model behaviour and more trustworthy attention heatmaps

  • We introduce a novel learning objective based on probability divergence that rewards faithful behavior and which can be included in the training objective for NMT

Read more

Summary

Introduction

How trustworthy are our neural models? This question has led to a wide variety of contemporary NLP research focusing on (a) different axes of interpretability including plausibility (or interchangeably human-interpretability) (Herman, 2017; Lage et al, 2019) and faithfulness (Lipton, 2018; Jacovi and Goldberg, 2020b), (b) interpretation of the neural model components (Belinkov et al, 2017; Dalvi et al, 2017; Vig and Belinkov, 2019), (c) explaining the decisions made by neural models to humans (using explanations, highlights, rationales, etc.) (Ribeiro et al, 2016; Li et al, 2016; Ding et al, 2017; Ghaeini et al, 2018; Bastings et al, 2019; Jain et al, 2020), and (d) evaluating different explanation methods from different perspectives attention weights. We focus on faithfulness which intuitively provides the extent to which an explanation accurately represents the true reasoning behind a prediction It is important for NLP practitioners who wish to debug their neural models and improve them. Jacovi and Goldberg (2020b) emphasize distinguishing faithfulness from human-interpretability in interpretability research by providing several clarifications about the terminology used by researchers. Jacovi and Goldberg (2020a) Aligned with these criteria, we study faithfulness of NLP neural models, NMT models. We provide a faithfulness measure that is computed based on a variety of stress tests where attention weights that are crucial for prediction are perturbed. We propose a novel differentiable objective based on probability divergence and study its effect on the discrete faithfulness measure. We expect larger and overparameterized models to get worse in terms of faithfulness because the language model in the decoder gets stronger in guessing the word which, as we shall discuss in more detail later, tends to make attention less faithful

Faithfulness in NMT Models
Approach
Divergence-based Faithfulness Objective
On Attention Sparsity
Datasets
Architecture and Hyperparameters
Training Difficulties
Impact on Faithfulness
POS-tag Analysis
Effect of Training With Single Adversary on Passing Other Stress Tests
Regularization Effect
Objective
Do the New Models Have Sparser Attention?
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call