TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies

Kak Soky,Chenchen Ding,Chenhui Chu,Tatsuya Kawahara,Sheng Li,Masato Mimura,Sethserey Sam

doi:10.1142/s2717554522500072

Abstract

This paper presents an extended work on the trilingual spoken language translation corpus of the Extraordinary Chambers in the Courts of Cambodia (ECCC), namely TriECCC. TriECCC is a simultaneously spoken language translation corpus with parallel resources of speech and text in three languages: Khmer, English, and French. This corpus has approximately [Formula: see text] thousand utterances, approximately [Formula: see text], [Formula: see text], and [Formula: see text] h in length of speech, and [Formula: see text], [Formula: see text] and [Formula: see text] million words in text, in Khmer, English, and French, respectively. We first report the baseline results of machine translation (MT), and speech translation (ST) systems, which show reasonable performance. We then investigate the use of the ROVER method to combine multiple MT outputs and fine-tune the pre-trained English–French MT models to enhance the Khmer MT systems. Experimental results show that the ROVER is effective for combining English-to-Khmer and French-to-Khmer systems. Fine-tuning from both single and multiple parents shows the effective improvement on the BLEU scores for Khmer-to-English/French and English/French-to-Khmer MT systems.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Asian Language Processing	Publication Date: Sep 1, 2021
Citations: 2	License type: other-oa

R Discovery Prime

R Discovery Prime

TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies

Abstract

Talk to us

Similar Papers

More From: International Journal of Asian Language Processing

Lead the way for us

Similar Papers

Using a mixture of N-best lists from multiple MT systems in rank-sum-based confidence measure for MT outputs
Yasuhiro Akiba ... Seiichi Yamamoto
-
Yasuhiro Akiba, et. al.Yasuhiro Akiba ... Seiichi Yamamoto
01 Jan 2004
01 Jan 2004

Khmer Speech Translation Corpus of the Extraordinary Chambers in the Courts of Cambodia (ECCC)
Kak Soky ... Masato Mimura
-
Kak Soky, et. al.Kak Soky ... Masato Mimura
18 Nov 2021
18 Nov 2021

Combining Machine Translated Sentence Chunks from Multiple MT Systems
Matīss Rikters ... Inguna Skadiņa
-
Matīss Rikters, et. al.Matīss Rikters ... Inguna Skadiņa
01 Jan 2018
01 Jan 2018

An Evaluation of the Accuracy of the Machine Translation Systems of Social Media Language
Yasser Muhammad Naguib Sabtan ...
International Journal of Advanced Computer Science and Applications | VOL. 12
Yasser Muhammad Naguib Sabtan, et. al.Yasser Muhammad Naguib Sabtan ...
01 Jan 2020
International Journal of Advanced Computer Science and Applications | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies

Abstract

Talk to us

Similar Papers

More From: International Journal of Asian Language Processing