Abstract

A Recurrent Neural Networks (RNN) based attention model has been used in code-switching speech recognition (CSSR). However, due to the sequential computation constraint of RNN, there are stronger short-range dependencies and weaker long-range dependencies, which makes it hard to immediately switch languages in CSSR. Firstly, to deal with this problem, we introduce the CTC-Transformer, relying entirely on a self-attention mechanism to draw global dependencies and adopting connectionist temporal classification (CTC) as an auxiliary task for better convergence. Secondly, we proposed two multi-task learning recipes, where a language identification (LID) auxiliary task is learned in addition to the CTC-Transformer automatic speech recognition (ASR) task. Thirdly, we study a decoding strategy to combine the LID into an ASR task. Experiments on the SEAME corpus demonstrate the effects of the proposed methods, achieving a mixed error rate (MER) of 30.95%. It obtains up to 19.35% relative MER reduction compared to the baseline RNN-based CTC-Attention system, and 8.86% relative MER reduction compared to the baseline CTC-Transformer system.

Highlights

  • Code-switching (CS) speech is defined as speech which contains more than one language within an utterance [1]

  • Under the configuration of the label sequence (LLS) method, connectionist temporal classification (CTC) joint training for the language identification (LID) task and LID joint decoding, the final system achieves an mixed error rate (MER) of (30.95%), obtaining up to a (19.35%) and (8.86%) relative MER reduction compared to the Recurrent Neural Networks (RNN)-based CTC-Attention baseline system (38.38%) and the CTC-Transformer baseline system (33.96%) respectively

  • We introduce a CTC-Transformer based E2E model for Mandarin–English code-switching speech recognition (CSSR), which outperforms most of the traditional systems on the SEAME corpus

Read more

Summary

Introduction

Code-switching (CS) speech is defined as speech which contains more than one language within an utterance [1]. Several methods are proposed to improve the performance of language modeling to CS speech: recurrent neural network language models and factored language models with the integration of part-of-speech tag, language information, or syntactic and semantic features [5,6,7]. To address the latter challenge, speaker adaptation, phone sharing and phone merging are applied [4]. The Transformer [19,20] has achieved state-of-the-art performances in many monolingual ASRs [21] It transduces sequential data with its self-attention mechanism, which replaces the RNN in previous works [15,18]. All of our experiments are conducted on the SEAME corpus

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call