Phrase Pairs Research Articles

BackgroundThe increasing global cancer incidence corresponds to serious health impact in countries worldwide. Knowledge-powered health system in different languages would enhance clinicians’ healthcare practice, patients’ health management and public health literacy. High-quality corpus containing cancer information is the necessary foundation of cancer education. Massive non-structural information resources exist in clinical narratives, electronic health records (EHR) etc. They can only be used for training AI models after being transformed into structured corpus. However, the scarcity of multilingual cancer corpus limits the intelligent processing, such as machine translation in medical scenarios. Thus, we created the cancer specific cross-lingual corpus and open it to the public for academic use.MethodsAiming to build an English-Chinese cancer parallel corpus, we developed a workflow of seven steps including data retrieval, data parsing, data processing, corpus implementation, assessment verification, corpus release, and application. We applied the workflow to a cross-lingual, comprehensive and authoritative cancer information resource, PDQ (Physician Data Query). We constructed, validated and released the parallel corpus named as ECCParaCorp, made it openly accessible online.ResultsThe proposed English-Chinese Cancer Parallel Corpus (ECCParaCorp) consists of 6685 aligned text pairs in Xml, Excel, Csv format, containing 5190 sentence pairs, 1083 phrase pairs and 412 word pairs, which involved information of 6 cancers including breast cancer, liver cancer, lung cancer, esophageal cancer, colorectal cancer, and stomach cancer, and 3 cancer themes containing cancer prevention, screening, and treatment. All data in the parallel corpus are online, available for users to browse and download (http://www.phoc.org.cn/ECCParaCorp/).ConclusionsECCParaCorp is a parallel corpus focused on cancer in a cross-lingual form, which is openly accessible. It would make up the imbalance of scarce multilingual corpus resources, bridge the gap between human readable information and machine understanding data resources, and would contribute to intelligent technology application as a preparatory data foundation e.g. cancer-related machine translation, cancer system development towards medical education, and disease-oriented knowledge extraction.

Read full abstract

AbstractNeural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.

Read full abstract

Phrase Pairs Research Articles

Related Topics

Articles published on Phrase Pairs

Bilingual phrase induction with local hard negative sampling

Age effects on prosodic boundary perception.

Lexico-syntactic analysis of antonyms in some speeches of John Dramani Mahama

PhrasIS: Phrase Inference and Similarity benchmark

Addressing data scarcity issue for English–Mizo neural machine translation using data augmentation and language model

Exploration and Research on Chinese Culture in University English Translation Teaching in the Background of Internet+

CORRECTION OF TINNITUS AND NEUROPSYCHOLOGICAL STATUS OF PATIENTS WITH CEREBRAL ATHEROSCLEROSIS AND ESSENTIAL HYPERTENSION USING CITICOLINE AS PART OF COMPREHENSIVE THERAPY

ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application

Neural machine translation of low-resource languages using SMT phrase pair injection

Distributed Representation of Words in Cause and Effect Spaces

Syntactic Matching Methods in Pivot Translation

A unified framework and models for integrating translation memory into phrase-based statistical machine translation

Research on the Information Processing Model of Source Language of Parallel Corpus

Integrating Shallow Syntactic Labels in the Phrase-Boundary Translation Model

Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation

Minimum Bayes-Risk Phrase Table Pruning for Pivot-Based Machine Translation in Internet of Things

Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation

Extracting parallel phrases from comparable data for machine translation

Integrating Rules and Dictionaries from Shallow-Transfer Machine Translation into Phrase-Based Statistical Machine Translation

Extraction and Presentation of Bilingual Correspondences from Slovak-Bulgarian Parallel Corpus

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Phrase Pairs Research Articles

Related Topics

Articles published on Phrase Pairs

Bilingual phrase induction with local hard negative sampling

Age effects on prosodic boundary perception.

Lexico-syntactic analysis of antonyms in some speeches of John Dramani Mahama

PhrasIS: Phrase Inference and Similarity benchmark

Addressing data scarcity issue for English–Mizo neural machine translation using data augmentation and language model

Exploration and Research on Chinese Culture in University English Translation Teaching in the Background of Internet+

CORRECTION OF TINNITUS AND NEUROPSYCHOLOGICAL STATUS OF PATIENTS WITH CEREBRAL ATHEROSCLEROSIS AND ESSENTIAL HYPERTENSION USING CITICOLINE AS PART OF COMPREHENSIVE THERAPY

ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application

Neural machine translation of low-resource languages using SMT phrase pair injection

Distributed Representation of Words in Cause and Effect Spaces

Syntactic Matching Methods in Pivot Translation

A unified framework and models for integrating translation memory into phrase-based statistical machine translation

Research on the Information Processing Model of Source Language of Parallel Corpus

Integrating Shallow Syntactic Labels in the Phrase-Boundary Translation Model

Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation

Minimum Bayes-Risk Phrase Table Pruning for Pivot-Based Machine Translation in Internet of Things

Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation

Extracting parallel phrases from comparable data for machine translation

Integrating Rules and Dictionaries from Shallow-Transfer Machine Translation into Phrase-Based Statistical Machine Translation

Extraction and Presentation of Bilingual Correspondences from Slovak-Bulgarian Parallel Corpus