Abstractive summarization methods typically follow the autoregressive paradigm using the causal masks in the decoder for training and inference efficiency. However, this approach leads to a constant context throughout the generation process, which conflicting with the bidirectional characteristics of natural language. Although previous attempts have been made to incorporate bidirectional attention in the decoding process through non-autoregressive approach, the evaluation results are not comparable to the autoregressive methods. To bring bidirectional attention to the autoregressive process while maintaining superior performance, we propose the global autoregressive paradigm, which takes the outputs of the autoregressive process as additional inputs in the subsequent global iteration. Specifically, we build a bidirectional decoder alongside the original encoder and decoder to capture the bidirectional context of the outputs. This context is updated after each autoregressive decoding iteration. The decoder then integrates the updated context into subsequent autoregressive decoding steps, enhancing the generative process with a more comprehensive and authentic context. Additionally, we use contrastive learning to train the model to extract reliable features from the bidirectional context and apply reinforcement learning to improve the model's utilization of this context. We evaluate our method on CNN/DM, XSum, and NYT datasets, and the results highlight the significance of the bidirectional context. Our method achieves the best performance in terms of ROUGE-2 on CNN/DM (23.96), and performs comparably on XSum (25.45) and NYT (27.91). It also outperforms all the baselines in terms of BERTScore, with a score of 89.96 on CNN/DM, 92.70 on XSum, and 90.04 on NYT. Furthermore, our method can perform better with a larger beam size.
Read full abstract