ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application

Hetong Ma,Min Dai,Qing Qian,An Fang,Jie He,Ni Li,Jiansong Ren,Xuwen Wang,Feihong Yang,Jiao Li

doi:10.1186/s12911-020-1116-1

Abstract

BackgroundThe increasing global cancer incidence corresponds to serious health impact in countries worldwide. Knowledge-powered health system in different languages would enhance clinicians’ healthcare practice, patients’ health management and public health literacy. High-quality corpus containing cancer information is the necessary foundation of cancer education. Massive non-structural information resources exist in clinical narratives, electronic health records (EHR) etc. They can only be used for training AI models after being transformed into structured corpus. However, the scarcity of multilingual cancer corpus limits the intelligent processing, such as machine translation in medical scenarios. Thus, we created the cancer specific cross-lingual corpus and open it to the public for academic use.MethodsAiming to build an English-Chinese cancer parallel corpus, we developed a workflow of seven steps including data retrieval, data parsing, data processing, corpus implementation, assessment verification, corpus release, and application. We applied the workflow to a cross-lingual, comprehensive and authoritative cancer information resource, PDQ (Physician Data Query). We constructed, validated and released the parallel corpus named as ECCParaCorp, made it openly accessible online.ResultsThe proposed English-Chinese Cancer Parallel Corpus (ECCParaCorp) consists of 6685 aligned text pairs in Xml, Excel, Csv format, containing 5190 sentence pairs, 1083 phrase pairs and 412 word pairs, which involved information of 6 cancers including breast cancer, liver cancer, lung cancer, esophageal cancer, colorectal cancer, and stomach cancer, and 3 cancer themes containing cancer prevention, screening, and treatment. All data in the parallel corpus are online, available for users to browse and download (http://www.phoc.org.cn/ECCParaCorp/).ConclusionsECCParaCorp is a parallel corpus focused on cancer in a cross-lingual form, which is openly accessible. It would make up the imbalance of scarce multilingual corpus resources, bridge the gap between human readable information and machine understanding data resources, and would contribute to intelligent technology application as a preparatory data foundation e.g. cancer-related machine translation, cancer system development towards medical education, and disease-oriented knowledge extraction.

Highlights

The increasing global cancer incidence corresponds to serious health impact in countries worldwide
ECCParaCorp is a parallel corpus focused on cancer in a cross-lingual form, which is openly accessible
It would make up the imbalance of scarce multilingual corpus resources, bridge the gap between human readable information and machine understanding data resources, and would contribute to intelligent technology application as a preparatory data foundation e.g. cancer-related machine translation, cancer system development towards medical education, and disease-oriented knowledge extraction

Summary

Introduction

The increasing global cancer incidence corresponds to serious health impact in countries worldwide. High-quality corpus containing cancer information is the necessary foundation of cancer education. Massive non-structural information resources exist in clinical narratives, electronic health records (EHR) etc. They can only be used for training AI models after being transformed into structured corpus. The intelligence application of NLP method on health especially on cancer, which could both improve the physician perspectives and share cutting-edge science performance to the public, is a great choice for better health research. The data foundation on a specific theme should be prepared first. This leads to the importance of collecting most updating research or information recorded in various languages

Methods

Results

Discussion

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Medical Informatics and Decision Making	Publication Date: Jul 1, 2020
Citations: 5	License type: open-access

R Discovery Prime

ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making

Lead the way for us

Similar Papers

English-Chinese Machine Translation Based on Transfer Learning and Chinese-English Corpus.
Bo Xu
Computational Intelligence and Neuroscience | VOL. 2022
Bo XuBo Xu
27 Sep 2022
Computational Intelligence and Neuroscience | VOL. 2022

Automatic Construction of Web-Based English/Chinese Parallel Corpora
Bin Tan ... Xu-Yan Zhou
-
Bin Tan, et. al.Bin Tan ... Xu-Yan Zhou
01 Apr 2010
01 Apr 2010

Chinese temporal relation resolution based on Chinese-English parallel corpus
Huilin Wang ... Yanqing He
International Journal of Embedded Systems | VOL. 9
Huilin Wang, et. al.Huilin Wang ... Yanqing He
01 Jan 2017
International Journal of Embedded Systems | VOL. 9

NCI's Physician Data Query (PDQ®) Cancer Information Summaries: History, Editorial Processes, Influence, and Reach
Richard E Manrow ... Margaret Beckwith
Journal of Cancer Education | VOL. 29
Richard E Manrow, et. al.Richard E Manrow ... Margaret Beckwith
01 Sep 2013
Journal of Cancer Education | VOL. 29

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making