MedDialog: Large-scale Medical Dialogue Datasets

Guangtao Zeng,Shu Chen,Yue Yang,Ruisi Zhang,Meng Zhou,Jiaqi Zeng,Sicheng Wang,Ruoyu Zhang,Zeqian Ju,Hongchao Fang,Wenmian Yang,Penghui Zhu,Pengtao Xie,Xiangyu Dong

doi:10.18653/v1/2020.emnlp-main.743

Abstract

Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build large-scale medical dialogue datasets -- MedDialog, which contain 1) a Chinese dataset with 3.4 million conversations between patients and doctors, 11.3 million utterances, 660.2 million tokens, covering 172 specialties of diseases, and 2) an English dataset with 0.26 million conversations, 0.51 million utterances, 44.53 million tokens, covering 96 specialties of diseases. To our best knowledge, MedDialog is the largest medical dialogue dataset to date. We pretrain several dialogue generation models on the Chinese MedDialog dataset, including Transformer, GPT, BERT-GPT, and compare their performance. It is shown that models trained on MedDialog are able to generate clinically correct and doctor-like medical dialogues. We also study the transferability of models trained on MedDialog to low-resource medical dialogue generation tasks. It is shown that via transfer learning which finetunes the models pretrained on MedDialog, the performance on medical dialogue generation tasks with small datasets can be greatly improved, as shown in human evaluation and automatic evaluation. The datasets and code are available at https://github.com/UCSD-AI4H/Medical-Dialogue-System

Highlights

Telemedicine refers to the practice of delivering patient care remotely, where doctors provide medical consultations to patients using HIPAA compliant video-conferencing tools
Through human evaluation and automatic evaluation, we show that the pretrained models on MedDialog-CN can significantly improve performance on medical dialogue generation tasks where the dataset size is small, via transfer learning
Given a dialogue containing a sequence of alternating utterances between patient and doctor, we process it into a set of pairs {(si, ti)} where the target ti is a response from the doctor and the source si is the concatenation of all utterances before ti

Summary

Introduction

Telemedicine refers to the practice of delivering patient care remotely, where doctors provide medical consultations to patients using HIPAA compliant video-conferencing tools. To address the limitations of existing datasets, we build large-scale medical dialogue datasets – MedDialog – that contain 1) a Chinese dataset with 3.4 million conversations between patients and doctors, 11.3 million utterances, 660.2 million tokens, covering 172 specialties of diseases, and 2) an English dataset with 0.26 million conversations, 0.51 million utterances, 44.53 million tokens, covering 96 specialties of diseases. We pretrain several dialogue generation models on the Chinese MedDialog dataset, including Transformer, BERT-GPT, and GPT, and compare their performance using automatic metrics. Through human evaluation and automatic evaluation, we show that the pretrained models on MedDialog-CN can significantly improve performance on medical dialogue generation tasks where the dataset size is small, via transfer learning. # dialogues # utterances # tokens Avg. # of utterances in a dialogue Max # of utterances in a dialogue Min # of utterances in a dialogue Avg. # of tokens in an utterance Max # of tokens in an utterance Min # tokens in an utterance

Related Works

The Chinese MedDialog dataset

The English MedDialog dataset

Advantages of our datasets

Methods

Dialogue Generation as Sequence-to-Sequence Modeling

Dialogue Generation as Language Modeling

Pretraining

Experimental Settings

Results

Transfer to Other Datasets

Experimental settings

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MedDialog: Large-scale Medical Dialogue Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 68	License type: cc-by

Similar Papers

Research on Medical Dialogue Generation of External Knowledge
Na Liu ... Feng Huang
International Journal of Advanced Network, Monitoring and Controls | VOL. 8
Na Liu, et. al.Na Liu ... Feng Huang
01 Sep 2023
International Journal of Advanced Network, Monitoring and Controls | VOL. 8

How can entities improve the quality of medical dialogue generation?
Longxiang Xiong ... Yuchun Guo
-
Longxiang Xiong, et. al.Longxiang Xiong ... Yuchun Guo
01 Jan 2023
01 Jan 2023

MIE: A Medical Information Extractor towards Medical Dialogues
Yuanzhe Zhang ... Zhongtao Jiang
-
Yuanzhe Zhang, et. al.Yuanzhe Zhang ... Zhongtao Jiang
01 Jan 2020
01 Jan 2020

From text to graph: a general transition-based AMR parsing using neural network
Min Gu ... Junsheng Zhou
Neural Computing and Applications | VOL. 33
Min Gu, et. al.Min Gu ... Junsheng Zhou
15 Oct 2020
Neural Computing and Applications | VOL. 33

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MedDialog: Large-scale Medical Dialogue Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers