Abstract

We present CLTS, a Chinese long text summarization dataset, in order to solve the problem that large-scale and high-quality datasets are scarce in automatic summarization, which is a limitation for further research. To the best of our knowledge, it is the first long text summarization dataset in Chinese. Extracted from the Chinese news website ThePaper.cn (https://www.thepaper.cn/), the corpus contains more than 180,000 Chinese long articles and corresponding summaries written by professional editors and authors, which is available online (CLTS dataset is available to download online at https://github.com/lxj5957/CLTS-Dataset). We train and evaluate several existing methods on CLTS to verify the utility and challenges of the dataset, and the results show that the corpus proposed in this paper is useful to set some baselines to contribute to the further research on automatic text summarization.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.