Abstract
Cross-lingual summarization is a challenging task for which there are no cross-lingual scientific resources currently available. To overcome the lack of a high-quality resource, we present a new dataset for monolingual and cross-lingual summarization considering the English-German pair. We collect high-quality, real-world cross-lingual data from Spektrum der Wissenschaft, which publishes human-written German scientific summaries of English science articles on various subjects. The generated Spektrum dataset is small; therefore, we harvest a similar dataset from the Wikipedia Science Portal to complement it. The Wikipedia dataset consists of English and German articles, which can be used for monolingual and cross-lingual summarization. Furthermore, we present a quantitative analysis of the datasets and results of empirical experiments with several existing extractive and abstractive summarization models. The results suggest the viability and usefulness of the proposed dataset for monolingual and cross-lingual summarization.
Highlights
Introduction evaluation of MS and CLSWe collect our primary dataset from SPEKTRUM, consisting of 1,510 En-The summarization research has recently shifted glish science articles with human-written German from monolingual summarization (MS) to cross- summaries
This paper aims to address this issue by developing a summarization dataset containing scientific texts of the English-German language pair from two resources, Spektrum der Wissenschaft (SPEKTRUM) and the Wikipedia Science Portal (WSP)
We collect our primary dataset from SPEKTRUM, consisting of 1,510 En-The summarization research has recently shifted glish science articles with human-written German from monolingual summarization (MS) to cross- summaries
Summary
The summarization research has recently shifted glish science articles with human-written German from monolingual summarization (MS) to cross- summaries. The collection on existing monolingual news datasets and off-the- of data from two different resources ensures divershelf machine translation (MT) systems which may sity in the written text and topics. It is worth noting introduce noise into pseudo-cross-lingual summa- that the WIKIPEDIA dataset can be used for rization (PCLS) data. As these CLS studies rely on MS, which distinguishes it from existing datasets
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.