Abstract

In order to alleviate the shortage of multi-domain data and to capture discourse phenomena for task-oriented dialogue modeling, we propose RiSAWOZ, a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains, which is larger than all previous annotated H2H conversational datasets. Both single- and multi-domain dialogues are constructed, accounting for 65% and 35%, respectively. Each dialogue is labeled with comprehensive dialogue annotations, including dialogue goal in the form of natural language description, domain, dialogue states and acts at both the user and system side. In addition to traditional dialogue annotations, we especially provide linguistic annotations on discourse phenomena, e.g., ellipsis and coreference, in dialogues, which are useful for dialogue coreference and ellipsis resolution tasks. Apart from the fully annotated dataset, we also present a detailed description of the data collection procedure, statistics and analysis of the dataset. A series of benchmark models and results are reported, including natural language understanding (intent detection & slot filling), dialogue state tracking and dialogue context-to-text generation, as well as coreference and ellipsis resolution, which facilitate the baseline comparison for future research on this corpus.

Highlights

  • IntroductionWe have witnessed that a variety of datasets tailored for task-oriented dialogue have been constructed, such as MultiWOZ (Budzianowski et al, 2018), SGD

  • In recent years, we have witnessed that a variety of datasets tailored for task-oriented dialogue have been constructed, such as MultiWOZ (Budzianowski et al, 2018), SGD (Rastogi et al., 2019a) and CrossWOZ (Zhu et al, 2020), along with the increasing interest in conversational AI in both academia and industry (Gao et al, 2018)

  • In order to facilitate such task reformulation, we provide the second type of linguistic annotation on RiSAWOZ: utterance rewriting for ellipsis and coreference resolution

Read more

Summary

Introduction

We have witnessed that a variety of datasets tailored for task-oriented dialogue have been constructed, such as MultiWOZ (Budzianowski et al, 2018), SGD MultiWOZ (Budzianowski et al, 2018), probably the most promising and notable dialogue corpus collected in a Wizard-of-Oz (i.e., Human-toHuman) way recently, is one order of magnitude larger than the aforementioned corpora collected in the same way It contains noisy systemside state annotations and lacks user-side dialogue acts (Eric et al, 2019; Zhu et al, 2020). In order to alleviate the aforementioned issues, we propose RiSAWOZ, a large-scale Chinese multi-domain Wizard-of-Oz task-oriented dialogue dataset with rich semantic annotations. RiSAWOZ is to date the largest fully annotated human-to-human task-oriented dialogue dataset to our knowledge It contains 11,200 multi-turn dialogues with more than 150K utterances ranging over 12 domains, namely Attraction, Restaurant, Hotel, Flight, Train, Weather, Movie, TV, Computer, Car, Hospital and Education ( after-school remedial courses), several of which are not covered in previous datasets. The dataset and the benchmark models will be publicly available soon

Related Work
Database and Ontology Construction
Dialogue Goal
Dialogue Collection and Annotation
User Side
System Side
Coreference Clusters Annotation
Ellipsis and Coreference Annotation via Utterance Rewriting
Our Dialogue Dataset
Data Structure
Data Statistics
RiSAWOZ as a New Benchmark
Natural Language Understanding
Dialogue State Tracking
Dialogue Context-to-Text Generation
Results
Coreference Resolution
Unified Generative Ellipsis and Coreference Resolution
Other Tasks
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.