Machine translation based data augmentation for Cantonese keyword spotting

Guangpu Huang,Jean-Luc Gauvain,Arseniy Gorin,Lori Lamel

doi:10.1109/icassp.2016.7472833

Machine translation based data augmentation for Cantonese keyword spotting

Guangpu Huang, Jean-Luc Gauvain + Show 2 more

https://doi.org/10.1109/icassp.2016.7472833

Copy DOI

Publication Date: Mar 1, 2016

Citations: 32

Affiliation: Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur, French National Centre for Scientific Research

#Language Model For Speech Recognition #Actual Term-weighted Value + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

This paper presents a method to improve a language model for a limited-resourced language using statistical machine translation from a related language to generate data for the target language. In this work, the machine translation model is trained on a corpus of parallel Mandarin-Cantonese subtitles and used to translate a large set of Mandarin conversational telephone transcripts to Cantonese, which has limited resources. The translated transcripts are used to train a more robust language model for speech recognition and for keyword search in Cantonese conversational telephone speech. This method enables the keyword search system to detect 1.5 times more out-of-vocabulary words, and achieve 1.7% absolute improvement on actual term-weighted value.

Full Text