NEWSFARM: A Large-Scale Chinese Corpus of Long News Summarization

Shunan Zang,Xiaojun Chen,Jie Liu,Chuang Zhang,Peng Zhang,Xiaojun Liu

doi:10.1109/icpr56361.2022.9956691

Abstract

Recently, the field of natural language processing (NLP) has grown rapidly, driven by massive datasets. At the same time, the need for automatic summarization systems has been rapidly increasing as the amount of textual information on the web and in large data centers became intractable for human readers. However, the lack of large-scale and high-quality Chinese datasets remain a critical bottleneck for further research on automatic text summarization. To close this gap, we searched domestic and foreign Chinese news websites and designed the FCFS (Format and Content Filtering System) to crawl and filter these records to construct NEWSFARM. The NEWSFARM is a large-scale Chinese long news summarization corpus, containing more than 220K Chinese long news and summaries written by professional editors or authors, all of which are released to the public <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> . We calculated the static metrics and designed many experiments with the baseline models to evaluate the dataset. By comparing with the common datasets, the results not only demonstrate the usefulness and challenges of the proposed corpus for automatic text summarization but also validate the superiority of FCFS.

Full Text