On Faithfulness and Factuality in Abstractive Summarization

Joshua Maynez,Ryan Mcdonald,Bernd Bohnet,Shashi Narayan

doi:10.18653/v1/2020.acl-main.173

Abstract

It is well known that the standard likelihood training and approximate decoding objectives in neural text generation models lead to less human-like responses for open-ended tasks such as language modeling and story generation. In this paper we have analyzed limitations of these models for abstractive document summarization and found that these models are highly prone to hallucinate content that is unfaithful to the input document. We conducted a large scale human evaluation of several neural abstractive summarization systems to better understand the types of hallucinations they produce. Our human annotators found substantial amounts of hallucinated content in all model generated summaries. However, our analysis does show that pretrained models are better summarizers not only in terms of raw metrics, i.e., ROUGE, but also in generating faithful and factual summaries as evaluated by humans. Furthermore, we show that textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.

Highlights

Current state of the art conditional text generation models accomplish a high level of fluency and coherence, mostly thanks to advances in sequenceto-sequence architectures with attention and copy (Sutskever et al, 2014; Bahdanau et al, 2015; Gu et al, 2016), fully attention-based Transformer architectures (Vaswani et al, 2017; Dai et al, 2019) and more recently pretrained language modeling for natural language understanding (Devlin et al, 2019; Radford et al, 2018; Yang et al, 2019; Liu et al, 2019)
ROUGE (Lin and Hovy, 2003) and BERTScore (Zhang et al, 2020) correlates less with faithfulness/factuality than metrics derived from automatic semantic inference systems, the degree to which a summary is entailed by the source document
We focus on the recently introduced extreme summarization dataset (XSUM, Narayan et al, 2018a)3 which comprises 226,711 British Broadcasting Corporation (BBC) articles paired with their singlesentence summaries, provided by the journalists writing the articles

Summary

Introduction

Current state of the art conditional text generation models accomplish a high level of fluency and coherence, mostly thanks to advances in sequenceto-sequence architectures with attention and copy (Sutskever et al, 2014; Bahdanau et al, 2015; Gu et al, 2016), fully attention-based Transformer architectures (Vaswani et al, 2017; Dai et al, 2019) and more recently pretrained language modeling for natural language understanding (Devlin et al, 2019; Radford et al, 2018; Yang et al, 2019; Liu et al, 2019). They have the highest percentage of extrinsic hallucinations that are factual This suggests that while some studies argue that large-scale pretrained models are merely better at learning data-specific regularities (Niven and Kao, 2019), at least on in-domain summarization the gains in automatic metrics are realized in observable differences by humans. ROUGE (Lin and Hovy, 2003) and BERTScore (Zhang et al, 2020) correlates less with faithfulness/factuality than metrics derived from automatic semantic inference systems, the degree to which a summary is entailed by the source document. This presents an opportunity for improved automatic evaluation measures as well as model training and decoding objectives.

Hallucinations in Summarization

Intrinsic and Extrinsic Hallucinations

Factual Hallucinations in Summarization

Extreme Document Summarization

Abstractive Summaries

Experiments and Results

Automatic Evaluation of Summaries

Assessment of Hallucinations

Automatic Measures for Hallucinations

Model Selection with Entailment

Related Work

Conclusion

A Model Hyperparameters and Predictions

B Inter annotator agreement

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

On Faithfulness and Factuality in Abstractive Summarization

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 345	License type: cc-by

Similar Papers

A Survey on Evaluation Metrics for Machine Translation
Seungjun Lee ... Seonmin Koo
Mathematics | VOL. 11
Seungjun Lee, et. al.Seungjun Lee ... Seonmin Koo
16 Feb 2023
Mathematics | VOL. 11

ORANGE
Chin-Yew Lin ... Franz Josef Och
-
Chin-Yew Lin, et. al.Chin-Yew Lin ... Franz Josef Och
01 Jan 2004
01 Jan 2004

A comparative analysis of lexical-based automatic evaluation metrics for different Indic language pairs
Kiranjeet Kaur ... Shweta Chauhan
Journal of Autonomous Intelligence | VOL. 7
Kiranjeet Kaur, et. al.Kiranjeet Kaur ... Shweta Chauhan
02 Feb 2024
Journal of Autonomous Intelligence | VOL. 7

Comparison of template-based and multilayer perceptron-based approach for automatic question generation system
Walelign Tewabe Sewunetie ... László Kovács
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 28
Walelign Tewabe Sewunetie, et. al.Walelign Tewabe Sewunetie ... László Kovács
01 Dec 2022
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 28

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On Faithfulness and Factuality in Abstractive Summarization

Abstract

Highlights

Summary

Talk to us

Similar Papers