Abstract
In this paper, we propose a new method for code-switching (CS) automatic speech recognition (ASR) in Korean. First, the phonetic variations in English pronunciation spoken by Korean speakers should be considered. Thus, we tried to find a unified pronunciation model based on phonetic knowledge and deep learning. Second, we extracted the CS sentences semantically similar to the target domain and then applied the language model (LM) adaptation to solve the biased modeling toward Korean due to the imbalanced training data. In this experiment, training data were AI Hub (1033 h) in Korean and Librispeech (960 h) in English. As a result, when compared to the baseline, the proposed method improved the error reduction rate (ERR) by up to 11.6% with phonetic variant modeling and by 17.3% when semantically similar sentences were applied to the LM adaptation. If we considered only English words, the word correction rate improved up to 24.2% compared to that of the baseline. The proposed method seems to be very effective in CS speech recognition.
Highlights
Automatic speech recognition (ASR) and speech translation (ST) based on end-to-end (E2E) frameworks have shown significant improvements
In the case of Korean, English words pronounced by Korean speakers—Korean-style English (i.e., Konglish)—have many phonetic variations from native-like English pronunciation
To simultaneously avoid the data imbalance and low resources of CS, in this paper, we propose a hybrid method based on phonetic knowledge and deep learning, which integrates Korean and English data
Summary
Automatic speech recognition (ASR) and speech translation (ST) based on end-to-end (E2E) frameworks have shown significant improvements. These systems have been widely adapted to real-life situations, such as lectures, business meetings, and human– machine conversations. To figure out the effect of CS, we investigated how often Korean sentences have English words. These problems can be categorized into two types: the inter-sentential, where language transitions occur at the phrase, sentence, or discourse boundaries; and the.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have