Abstract

Automatic text summarization is important in this era due to the exponential growth of documents available on the Internet. In the Vietnamese language, VietnameseMDS is the only publicly available dataset for this task. Although the dataset has 199 clusters, there are only three documents in each cluster, which is small compared to typical datasets in English. This motivates us to construct ViMs—a big and high-quality Vietnamese dataset for abstractive multi-document summarization. To that end, we recruited 29 annotators and enhanced MDSWriter—an open-source annotation tool, to support the annotators in creating gold standard summaries. As a result, ViMs has 600 summaries corresponding to 300 clusters of 1,945 documents. We have verified the reliability of our dataset by using a variety of metrics including conventional Cohen’s $$\kappa $$ , relaxed Cohen’s $$\kappa $$ —a new metric that we propose to make it more suitable for abstractive summarization, and ROUGE scores. A relaxed $$\kappa $$ score of 0.55 indicate that ViMs could attain moderate agreement between annotators. Meanwhile, ROUGE scores are 0.729 of ROUGE-1, 0.507 of ROUGE-2 and 0.524 of ROUGE-SU4. We have further evaluated ViMs by using three different summarization systems: TextRank, CFVi and MUSEEC. Their performances are 0.628, 0.711 and 0.732 of ROUGE-1, respectively. These results show that the ViMs dataset is suitable for both training and evaluating multi-document summarization systems. We have made the dataset and evaluation results of this work publicly available for research community. It is noted that unlike previous work that only published the final summarization dataset, we also publish intermediate annotation results, which can be used in other NLP problems such as sentence classification.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.