Abstract
Recent work has shown that the encoder-decoder attention mechanisms in neural machine translation (NMT) are different from the word alignment in statistical machine translation. In this paper, we focus on analyzing encoder-decoder attention mechanisms, in the case of word sense disambiguation (WSD) in NMT models. We hypothesize that attention mechanisms pay more attention to context tokens when translating ambiguous words. We explore the attention distribution patterns when translating ambiguous nouns. Counterintuitively, we find that attention mechanisms are likely to distribute more attention to the ambiguous noun itself rather than context tokens, in comparison to other nouns. We conclude that attention is not the main mechanism used by NMT models to incorporate contextual information for WSD. The experimental results suggest that NMT models learn to encode contextual information necessary for WSD in the encoder hidden states. For the attention mechanism in Transformer models, we reveal that the first few layers gradually learn to “align” source and target tokens and the last few layers learn to extract features from the related but unaligned context tokens.
Highlights
Human languages exhibit many different types of ambiguity
We conclude that encoder-decoder attention is not the main mechanism used by neural machine translation (NMT) models to incorporate contextual information for Word sense disambiguation (WSD)
We assume that the contextual information has already been encoded into the hidden states by the encoder, and attention mechanisms do not learn which source words are useful for WSD
Summary
Human languages exhibit many different types of ambiguity. Lexical ambiguity refers to the fact that words can have more than one semantic meaning. We focus on the question of how encoder-decoder attention mechanisms deal with ambiguous nouns. In this setting, we expect to get a more accurate picture of the WSD performance of NMT models. We hypothesize that attention mechanisms distribute more attention to context tokens when translating. We explore the relation between accuracy and attention distributions when translating ambiguous nouns. We conclude that encoder-decoder attention is not the main mechanism used by NMT models to incorporate contextual information for WSD. It learns to capture features from the related but unaligned source context tokens
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have