Abstract

Discourse-annotated corpora are an important resource for the community, but they are often annotated according to different frameworks. This makes joint usage of the annotations difficult, preventing researchers from searching the corpora in a unified way, or using all annotated data jointly to train computational systems. Several theoretical proposals have recently been made for mapping the relational labels of different frameworks to each other, but these proposals have so far not been validated against existing annotations. The two largest discourse relation annotated resources, the Penn Discourse Treebank and the Rhetorical Structure Theory Discourse Treebank, have however been annotated on the same texts, allowing for a direct comparison of the annotation layers. We propose a method for automatically aligning the discourse segments, and then evaluate existing mapping proposals by comparing the empirically observed against the proposed mappings. Our analysis highlights the influence of segmentation on subsequent discourse relation labelling, and shows that while agreement between frameworks is reasonable for explicit relations, agreement on implicit relations is low. We identify several sources of systematic discrepancies between the two annotation schemes and discuss consequences for future annotation and for usage of the existing resources.

Highlights

  • We describe the notions underlying the two discourse relation annotation frameworks and their corresponding corpora that are mapped in this article, namely the Penn Discourse Treebank (PDTB) 2.0 and the RSTDT

  • The remaining 48% (2489 instances) of the data included in the mapping analysis consists of relations for which the Rhetorical Structure Theory Discourse Treebank (RST-DT) tree is more complex than the PDTB relation

  • The connective while is an example of a case where labels corresponded well to one another, i.e., where the null hypothesis of labels being independent could be rejected with high confidence: we find that annotators could reliably distinguish between the TEMPORAL.SYNCHRONOUS vs. CONTRAST / COMPARISON reading of while; the annotations from the two frameworks almost always agreed on the reading (p < 0.0001)

Read more

Summary

Introduction

We describe the notions underlying the two discourse relation annotation frameworks and their corresponding corpora that are mapped in this article, namely the PDTB 2.0 and the RSTDT. There are different implementations of RST annotation, including for example the Basque RST TreeBank (Iruskieta et al, 2013), the Potsdam Commentary Corpus (Stede & Neumann, 2014b), and the CSTNews Corpus (Cardoso et al, 2011). These corpora all follow the overall style of RST annotation, but may differ in how exactly they define discourse segments, what exact set of relation labels is chosen, and how nuclearity is interpreted or operationalized

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call