Abstract
Recent pre-trained abstractive summarization systems have started to achieve credible performance, but a major barrier to their use in practice is their propensity to output summaries that are not faithful to the input and that contain factual errors. While a number of annotated datasets and statistical models for assessing factuality have been explored, there is no clear picture of what errors are most important to target or where current techniques are succeeding and failing. We explore both synthetic and human-labeled data sources for training models to identify factual errors in summarization, and study factuality at the word-, dependency-, and sentence-level. Our observations are threefold. First, exhibited factual errors differ significantly across datasets, and commonly-used training sets of simple synthetic errors do not reflect errors made on abstractive datasets like XSum. Second, human-labeled data with fine-grained annotations provides a more effective training signal than sentence-level annotations or synthetic data. Finally, we show that our best factuality detection model enables training of more factual XSum summarization models by allowing us to identify non-factual tokens in the training data.
Highlights
In this paper, we aim to answer two main questions
Human-labeled data with fine-grained annotations provides a more effective training signal than sentence-level annotations or synmade by generation models? We find the answer is no: techniques using surface-level data corruption (Kryscinski et al, 2020; Zhao et al, 2020; Cao et al, 2020) or paraphrasing (Goyal and Durrett, 2020a) target inherently different error distributions than those seen in actual model generations, and factuality models trained on these datasets perform poorly in practice
We show that our best factuality detection model enables training of more factual XSUM summarization models by allowing us to identify non-factual tokens in the training data.1 that different summarization domains, CNN/Daily Mail (Hermann et al, 2015; Nallapati et al, 2016) and XSum (Narayan et al, 2018), exhibit substantially different error distributions in
Summary
British Broadcasting Corporation (BBC) articles, We call this set of approaches entity-centric where the first sentence of the article is treated as because the transformations largely focus on pera summary of the article. The approach from els trained on this dataset have to learn to model Kryscinski et al (2020) has the broadest set of long-range dependencies and may still be unable transformations out of this line of prior work, so to recover all information in the gold summary. In addition to sentence-level annotations, this approach extracts factuality labels corresponding to each dependency arc of the generated summary. To adapt this data creation approach for our current experimental setting, we generated paraphrases of gold summaries using the paraphrase generation model of Goyal and Durrett (2020b). We generate 40k training examples for both CNN/DM and XSUM domains
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have