Abstract
At present, there are few public datasets for speech translation, especially those between Chinese and other low-resource languages. The development of end-to-end speech translation is limited by resources. In light of the research idea of international speech translation datasets, in this paper, we used the public speech recognition datasets (AISHELL and THUYG-20) to convert them into speech translation datasets through machine translation. After data processing, they were reviewed and verified by experts, so as to obtain high-quality speech translation datasets. The dataset includes Chinese-Mongolian speech translation dataset and Uygur-Chinese speech translation dataset, and the audio sampling rate is 16 kHz. The Chinese-Mongolian speech translation subset contains 1,919 items with a size of 238 MB. The Uygur-Chinese speech translation subset contains 3,692 samples with a size of 652 MB. This dataset can be used for the research on end-to-end speech translation, and provide data support for exploring the speech translation between Chinese and minority languages. As the dataset has been reviewed and verified by experts, it can also be combined with speech recognition dataset to study machine translation.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.