Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark

Nouha Dziri,David Reitter,Hannah Rashkin,Tal Linzen

doi:10.1162/tacl_a_00506

Abstract

Abstract Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models’ responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at https://github.com/google/BEGIN-dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Sep 19, 2022
Citations: 8	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark

Abstract

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models
Majid Afshar ... Dina Demner-Fushman
Journal of Biomedical Informatics | VOL. 157
Majid Afshar, et. al.Majid Afshar ... Dina Demner-Fushman
13 Aug 2024
Journal of Biomedical Informatics | VOL. 157

Exploring Social Biases of Large Language Models in a College Artificial Intelligence Course
Skylar Kolisko ... Carolyn Jane Anderson
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 37
Skylar Kolisko, et. al.Skylar Kolisko ... Carolyn Jane Anderson
26 Jun 2023
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 37

#2924 Comparison of large language models and traditional natural language processing techniques in predicting arteriovenous fistula failure
Suman Lama ... Luca Neri
Nephrology Dialysis Transplantation | VOL. 39
Suman Lama, et. al.Suman Lama ... Luca Neri
23 May 2024
Nephrology Dialysis Transplantation | VOL. 39

Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review.
Xinsong Du ... Richard Yang
medRxiv : the preprint server for health sciences | VOL. -
Xinsong Du, et. al.Xinsong Du ... Richard Yang
19 Aug 2024
Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review.
Xinsong Du ... Richard Yang

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark

Abstract

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics