Abstract

Discourse parsing, which involves understanding the structure, information flow, and modeling the coherence of a given text, is an important task in natural language processing. It forms the basis of several natural language processing tasks such as question-answering, text summarization, and sentiment analysis. Discourse unit segmentation is one of the fundamental tasks in discourse parsing and refers to identifying the elementary units of text that combine to form a coherent text. In this paper, we present a transformer based approach towards the automated identification of discourse unit segments and connectives. Early approaches towards segmentation relied on rule-based systems using POS tags and other syntactic information to identify discourse segments. Recently, transformer based neural systems have shown promising results in this domain. Our system, SegFormers, employs this transformer based approach to perform multilingual discourse segmentation and connective identification across 16 datasets encompassing 11 languages and 3 different annotation frameworks. We evaluate the system based on F1 scores for both tasks, with the best system reporting the highest F1 score of 97.02% for the treebanked English RST-DT dataset.

Highlights

  • In the Penn Discourse TreeBank (PDTB) framework, the segmentation task corresponds to identifying the spans of discourse connectives that explicitly identify the presence of a discourse relation

  • The PDTB framework consists of labels that mark the entire span of discourse connectives that explicitly identify the existence of a discourse relation

  • The final precision, recall and F1 are quite higher than the recall (Basque dataset 95% precision and 61% recall, Russian dataset 84% precision and 60% recall), indicating that the model is primarily aiming for the generic discourse unit boundary detection at the beginning of the discourse segments

Read more

Summary

Datasets

We describe the datasets provided by the organizers of the CODI-DISRPT2021: Discourse Relation Parsing and Treebanking Shared Task at EMNLP 20211. The data provided consists of 16 datasets comprising of 11 languages (German, English, Basque, Persian, French, Dutch, Portuguese, Russian, Spanish, Turkish, and Mandarin Chinese). This is the first iteration of the Persian RST corpus (Shahmohammadi et al, 2021) being included for the task of discourse segmentation. The Chinese PDTB dataset (Zhou and Xue, 2015) is not available freely. The organizers provided the scores on this dataset after running the. Model on the CDTB dataset during the evaluation phase

Annotation frameworks
Languages
System Overview
Bidirectional LSTM
SegFormers
Results
Conclusion and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.