Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation

Neslihan Iskender,Tim Polzehl,Sebastian Möller

doi:10.18653/v1/2020.eval4nlp-1.16

Abstract

One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.

Highlights

Even though there has been an enormous increase in automatic summarization research, human evaluation of summarization is still an understudied aspect
Results are presented for the scores overall quality (OQ), the five intrinsic quality scores (including grammaticality (GR), non-redundancy (NR), referential clarity (RC), focus (FO), structure & coherence (SC)) and the three extrinsic quality scores (summary usefulness (SU), post usefulness (PU) and summary informativeness (SI))
We analyzed the BLEU, ROUGE1, ROUGE-2, ROUGE-L, BertScore, BLEURT by taking the mean of scores calculated using two expert summaries and the BLANC scores resulting in 350 scores (50 summaries x 7 automatic metrics)

Summary

Introduction

Even though there has been an enormous increase in automatic summarization research, human evaluation of summarization is still an understudied aspect. The authors did not apply any pre-qualification test, did not provide information about the number of crowd workers, did not apply annotation aggregation methods, or did not analyze the effect of reading effort and readability of source texts caused by the text’s structural, and formal composure. They used the TAC and CNN/Daily Mail data set derived from high-quality English texts. There is a research gap regarding the best practices for crowd-based evaluation of summarization, especially for languages other than English and noisy internet data

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 34	License type: cc-by

Similar Papers

A Survey on Evaluation Metrics for Machine Translation
Seungjun Lee ... Heuiseok Lim
Mathematics | VOL. 11
Seungjun Lee, et. al.Seungjun Lee ... Heuiseok Lim
16 Feb 2023
Mathematics | VOL. 11

Identification of Relevant and Redundant Automatic Metrics for MT Evaluation
Michal Munk ... Daša Munková
-
Michal Munk, et. al.Michal Munk ... Daša Munková
01 Jan 2015
01 Jan 2015

Comparison of template-based and multilayer perceptron-based approach for automatic question generation system
Walelign Tewabe Sewunetie ... László Kovács
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 28
Walelign Tewabe Sewunetie, et. al.Walelign Tewabe Sewunetie ... László Kovács
01 Dec 2022
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 28

Reassessing automatic evaluation metrics for code summarization tasks
Devjeet Roy ... Venera Arnaoudova
-
Devjeet Roy, et. al.Devjeet Roy ... Venera Arnaoudova
18 Aug 2021
18 Aug 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation

Abstract

Highlights

Summary

Talk to us

Similar Papers