Sequential learning using transformer has achieved state-of-the-art performance in natural language tasks and many others. The key to this success is the multi-head self attention which encodes and gathers the features from individual tokens of an input sequence. The mapping or decoding is performed to produce an output sequence via cross attention. There are threefold weaknesses by using such an attention framework. First, since the attention would mix up the features of different tokens in input and output sequences, it is likely that redundant information exists in sequence data representation. Second, the patterns of attention weights among different heads tend to be similar. The model capacity is bounded. Third, the robustness in an encoder-decoder network against the model uncertainty is disregarded. To handle these weaknesses, this paper presents a Bayesian semantic and disentangled mask attention to learn latent disentanglement in multi-head attention where the redundant features in transformer are compensated with the latent topic information. The attention weights are filtered by a mask which is optimized through semantic clustering. This attention mechanism is implemented according to Bayesian learning for clustered disentanglement. The experiments on machine translation and speech recognition show the merit of Bayesian clustered disentanglement for mask attention.
Read full abstract