A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

Daniel Deutsch,Rotem Dror,Dan Roth

doi:10.1162/tacl_a_00417

Daniel Deutsch, Rotem Dror + Show 1 more

Open Access

https://doi.org/10.1162/tacl_a_00417

Copy DOI

Abstract

Abstract The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high uncertainty in the reliability of automatic metrics. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do so in some evaluation settings.1

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Oct 27, 2021
Citations: 7	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

Abstract

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

A Survey on Evaluation Metrics for Machine Translation
Seungjun Lee ... Seonmin Koo
Mathematics | VOL. 11
Seungjun Lee, et. al.Seungjun Lee ... Seonmin Koo
16 Feb 2023
Mathematics | VOL. 11

A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

-

21 Oct 2021
21 Oct 2021

Comparison of template-based and multilayer perceptron-based approach for automatic question generation system
Walelign Tewabe Sewunetie ... László Kovács
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 28
Walelign Tewabe Sewunetie, et. al.Walelign Tewabe Sewunetie ... László Kovács
01 Dec 2022
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 28

Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Ananya B Sai ... Sreyas Mohan
-
Ananya B Sai, et. al.Ananya B Sai ... Sreyas Mohan
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

Abstract

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics