Abstract

A standard benchmark collection is essential to the reproducibility of any research. Several initial works in text summarisation suffered due to lack of standard evaluation corpora at that time [1, 8]. The advent of conferences like Document Understanding Conference(DUC) [2] and Text Analysis Conference(TAC) [18] solved that problem. These conferences generated standard evaluation benchmarks for text summarisation and as a result streamlined efforts were made possible. Today such benchmark collections of documents and related manually written summaries, provided by DUC and TAC are by far the most widely used collections for text summarisation. These have become essential for reproducibility as well as comparison of cross-system performance. However, with a lot of data-driven approaches being suggested in last few years the DUC and TAC collection, with their hundreds of article summary pairs, are no longer sufficient. There are a few other corpora like the Gigaword corpus and CNN/Dailymail [21] corpus which have millions of document-summary pairs. But these corpora are not publicly available and hence are of limited use. Moreover both these corpora, and also DUC and TAC, consist only of newswires. However, TAC did later introduced a task on biomedical article summarisation, which we discuss later in this chapter. But overall there are few domain-specific corpora that are both substantially large, to benefit the data-driven approaches, as well as publicly available. In this work we propose two new corpora for domain-specific summarisation in legal and scientific domains. The legal corpus consists of judgements delivered by the Supreme Court of India and their associated summary that are handwritten by legal experts. The corpus of scientific articles consists of research papers from the ACL anthology, which is a publicly available repository of research papers from computational linguistics and related domains. In this chapter we briefly discuss the DUC and TAC corpora as well as the corpora developed as a part of this work. We also provide an overview of the various strategies that are used to evaluate summarisation systems.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.