Abstract

One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.

Highlights

  • Even though there has been an enormous increase in automatic summarization research, human evaluation of summarization is still an understudied aspect

  • Results are presented for the scores overall quality (OQ), the five intrinsic quality scores (including grammaticality (GR), non-redundancy (NR), referential clarity (RC), focus (FO), structure & coherence (SC)) and the three extrinsic quality scores (summary usefulness (SU), post usefulness (PU) and summary informativeness (SI))

  • We analyzed the BLEU, ROUGE1, ROUGE-2, ROUGE-L, BertScore, BLEURT by taking the mean of scores calculated using two expert summaries and the BLANC scores resulting in 350 scores (50 summaries x 7 automatic metrics)

Read more

Summary

Introduction

Even though there has been an enormous increase in automatic summarization research, human evaluation of summarization is still an understudied aspect. The authors did not apply any pre-qualification test, did not provide information about the number of crowd workers, did not apply annotation aggregation methods, or did not analyze the effect of reading effort and readability of source texts caused by the text’s structural, and formal composure. They used the TAC and CNN/Daily Mail data set derived from high-quality English texts. There is a research gap regarding the best practices for crowd-based evaluation of summarization, especially for languages other than English and noisy internet data

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.