Abstract

Pointer Generators have been the de facto standard for modern summarization systems. However, this architecture faces two major drawbacks: Firstly, the pointer is limited to copying the exact words while ignoring possible inflections or abstractions, which restricts its power of capturing richer latent alignment. Secondly, the copy mechanism results in a strong bias towards extractive generations, where most sentences are produced by simply copying from the source text. In this paper, we address these problems by allowing the model to “edit” pointed tokens instead of always hard copying them. The editing is performed by transforming the pointed word vector into a target space with a learned relation embedding. On three large-scale summarization dataset, we show the model is able to (1) capture more latent alignment relations than exact word matches, (2) improve word alignment accuracy, allowing for better model interpretation and controlling, (3) generate higher-quality summaries validated by both qualitative and quantitative evaluations and (4) bring more abstraction to the generated summaries.

Highlights

  • Modern state-of-the-art (SOTA) summarization models are built upon the pointer generator architecture (See et al, 2017)

  • We propose Generalized Pointer Generator (GPG) which replaces the hard copy component with a more general soft “editing” function

  • We find the GPG model enables the point mode more frequently than standord pointer generators, especially on the Gigaword dataset (40% more)

Read more

Summary

Introduction

Modern state-of-the-art (SOTA) summarization models are built upon the pointer generator architecture (See et al, 2017). The model generates a sentinel to decide whether to sample words based on the neural attention (generation mode), or directly copy from an aligned source context (point mode) (Gu et al, 2016; Merity et al, 2017; Yang et al, 2017). Though outperforming the vanilla attention models, the pointer generator only captures exact word matches. These words are not covered by the point mode. In a seq2seq model, each source token xi is encoded into a vector hi. At each decoding step t, the decoder computes an attention distribution at over the encoded vectors based on the current hidden state dt (Bahdanau et al, 2015): at = softmax(f (hi, dt)) (1). A target vector yt∗ is predicted, words having a higher inner product with yt∗ will have a higher probability

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.