机器翻译辅助的中蒙、维汉语音翻译数据集子集

Ning Li

doi:10.11922/11-6035.csd.2021.0105.zh

Abstract

At present, there are few public datasets for speech translation, especially those between Chinese and other low-resource languages. The development of end-to-end speech translation is limited by resources. In light of the research idea of international speech translation datasets, in this paper, we used the public speech recognition datasets (AISHELL and THUYG-20) to convert them into speech translation datasets through machine translation. After data processing, they were reviewed and verified by experts, so as to obtain high-quality speech translation datasets. The dataset includes Chinese-Mongolian speech translation dataset and Uygur-Chinese speech translation dataset, and the audio sampling rate is 16 kHz. The Chinese-Mongolian speech translation subset contains 1,919 items with a size of 238 MB. The Uygur-Chinese speech translation subset contains 3,692 samples with a size of 652 MB. This dataset can be used for the research on end-to-end speech translation, and provide data support for exploring the speech translation between Chinese and minority languages. As the dataset has been reviewed and verified by experts, it can also be combined with speech recognition dataset to study machine translation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

机器翻译辅助的中蒙、维汉语音翻译数据集子集

Abstract

Talk to us

Similar Papers

More From: China Scientific Data

Lead the way for us

Journal: China Scientific Data	Publication Date: Jun 30, 2022
License type: cc-by

Similar Papers

Chinese-Mongolian、Uyghur-Chinese Speech Translation Database
Zhu Li Ping Zhu Li Ping ... Ning Li Ning Li
-
Zhu Li Ping Zhu Li Ping, et. al.Zhu Li Ping Zhu Li Ping ... Ning Li Ning Li
19 Jul 2022
19 Jul 2022

Tutorial Proposal: End-to-End Speech Translation
Jan Niehues ... Marco Turchi
-
Jan Niehues, et. al.Jan Niehues ... Marco Turchi
01 Jan 2020
01 Jan 2020

TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies
Kak Soky ... Sethserey Sam
International Journal of Asian Language Processing | VOL. 31
Kak Soky, et. al.Kak Soky ... Sethserey Sam
01 Sep 2021
International Journal of Asian Language Processing | VOL. 31

End-To-End Algorithm Optimization and Implementation Based on Speech Translation
Zheng Yang
-
Zheng YangZheng Yang
01 Sep 2021
01 Sep 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

机器翻译辅助的中蒙、维汉语音翻译数据集子集

Abstract

Talk to us

Similar Papers

More From: China Scientific Data