A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization

Mehwish Fatima,Michael Strube

doi:10.18653/v1/2021.newsum-1.5

Abstract

Cross-lingual summarization is a challenging task for which there are no cross-lingual scientific resources currently available. To overcome the lack of a high-quality resource, we present a new dataset for monolingual and cross-lingual summarization considering the English-German pair. We collect high-quality, real-world cross-lingual data from Spektrum der Wissenschaft, which publishes human-written German scientific summaries of English science articles on various subjects. The generated Spektrum dataset is small; therefore, we harvest a similar dataset from the Wikipedia Science Portal to complement it. The Wikipedia dataset consists of English and German articles, which can be used for monolingual and cross-lingual summarization. Furthermore, we present a quantitative analysis of the datasets and results of empirical experiments with several existing extractive and abstractive summarization models. The results suggest the viability and usefulness of the proposed dataset for monolingual and cross-lingual summarization.

Highlights

Introduction evaluation of MS and CLSWe collect our primary dataset from SPEKTRUM, consisting of 1,510 En-The summarization research has recently shifted glish science articles with human-written German from monolingual summarization (MS) to cross- summaries
This paper aims to address this issue by developing a summarization dataset containing scientific texts of the English-German language pair from two resources, Spektrum der Wissenschaft (SPEKTRUM) and the Wikipedia Science Portal (WSP)
We collect our primary dataset from SPEKTRUM, consisting of 1,510 En-The summarization research has recently shifted glish science articles with human-written German from monolingual summarization (MS) to cross- summaries

Summary

Introduction

The summarization research has recently shifted glish science articles with human-written German from monolingual summarization (MS) to cross- summaries. The collection on existing monolingual news datasets and off-the- of data from two different resources ensures divershelf machine translation (MT) systems which may sity in the written text and topics. It is worth noting introduce noise into pseudo-cross-lingual summa- that the WIKIPEDIA dataset can be used for rization (PCLS) data. As these CLS studies rely on MS, which distinguishes it from existing datasets

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 3	License type: cc-by

Similar Papers

A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization
...
-
, et. al. ...
21 Oct 2021
21 Oct 2021

Oversea Cross-Lingual Summarization Service in Multilanguage Pre-Trained Model through Knowledge Distillation
Xiwei Yang ... Bofei Zheng
Electronics | VOL. 12
Xiwei Yang, et. al.Xiwei Yang ... Bofei Zheng
14 Dec 2023
Electronics | VOL. 12

Unifying Cross-lingual Summarization and Machine Translation with Compression Rate
Yu Bai ... Kai Fan
-
Yu Bai, et. al.Yu Bai ... Kai Fan
06 Jul 2022
06 Jul 2022

Improving Neural Cross-Lingual Abstractive Summarization via Employing Optimal Transport Distance for Knowledge Distillation
Thong Thanh Nguyen ... Anh Tuan Luu
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 36
Thong Thanh Nguyen, et. al.Thong Thanh Nguyen ... Anh Tuan Luu
28 Jun 2022
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization

Abstract

Highlights

Summary

Talk to us

Similar Papers