Abstract

The end-to-end (E2E) framework has emerged as a viable alternative to conventional hybrid systems in automatic speech recognition (ASR) domain. Unlike the monolingual case, the challenges faced by an E2E system in code-switching ASR task include (i) the expansion of target set to account for multiple languages involved, (ii) the requirement of a robust target-to-word (T2W) transduction, and (iii) the need for more effective context modeling. In this paper, we aim to address those challenges for reliable training of the E2E ASR system on a limited amount of code-switching data. The main contribution of this work lies in the E2E target set reduction by exploiting the acoustic similarity and the proposal of a novel context-dependent T2W transduction scheme. Additionally, a novel textual feature has been proposed to enhance the context modeling in the case of code-switching data. The experiments are performed on a recently created Hindi-English code-switching corpus. For contrast purposes, the existing combined target set based system is also evaluated. The proposed system outperforms the existing one and yields a target error rate of 18.1% along with a word error rate of 29.79%.

Highlights

  • Multilingual speakers often alternate between two or more languages during the conversation

  • The broad domains that carry out research on code-switching phenomenon are (i) linguistics [4], [5], (ii) language identification and diarization [6], [7], (iii) automatic speech recognition (ASR) [8]–[10], and (iv) language modeling [11], [12]

  • Works on code-switching ASR [9], [13], [14] happen to employ the hybrid framework typically developed for monolingual ASR task

Read more

Summary

Introduction

Multilingual speakers often alternate between two or more languages (or dialects) during the conversation. In literature, this phenomenon is referred to as code-switching [1], [2]. The scope of this work is limited to building an ASR system for the code-switching data. Works on code-switching ASR [9], [13], [14] happen to employ the hybrid framework typically developed for monolingual ASR task. The hybrid framework comprises three sub-modules, namely, a pronunciation model (PM), an acoustic model (AM), and a language model (LM).

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call