Multitask Learning with Local Attention for Tibetan Speech Recognition

Hui Wang,Li Yang,Huilin Ma,Jianjian Yue,Fei Gao,Yue Zhao,Ning Cai

doi:10.1155/2020/8894566

Abstract

In this paper, we propose to incorporate the local attention in WaveNet-CTC to improve the performance of Tibetan speech recognition in multitask learning. With an increase in task number, such as simultaneous Tibetan speech content recognition, dialect identification, and speaker recognition, the accuracy rate of a single WaveNet-CTC decreases on speech recognition. Inspired by the attention mechanism, we introduce the local attention to automatically tune the weights of feature frames in a window and pay different attention on context information for multitask learning. The experimental results show that our method improves the accuracies of speech recognition for all Tibetan dialects in three-task learning, compared with the baseline model. Furthermore, our method significantly improves the accuracy for low-resource dialect by 5.11% against the specific-dialect model.

Highlights

Multitask learning has been applied successfully for speech recognition to improve the generalization performance of the model on the original task by sharing the information between related tasks [1,2,3,4,5,6,7,8,9]
In our previous work [10], Tibetan multidialect multitask speech recognition was conducted based on the WaveNet-Connectionist temporal classification (CTC), which performed simultaneous Tibetan multidialect speech content recognition, dialect identification, and speaker recognition in a single model
Two-task WaveNet-CTC improved the performance on speech recognition for Lhasa of U -Tsang dialect and Changdu-Kham dialect, the three-task model did not improve performance for all dialects

Summary

Introduction

Multitask learning has been applied successfully for speech recognition to improve the generalization performance of the model on the original task by sharing the information between related tasks [1,2,3,4,5,6,7,8,9]. Us, WaveNet-CTC was trained on three Tibetan dialect data sets and learned the shared representations and model parameters for speech recognition, speaker identification, and dialect recognition. To obtain a better performance, attention mechanism is introduced into WaveNet-CTC for multitask learning in this paper. Considering the computation complexity, we conduct a local attention using a sliding window on the whole of speech feature frames to create the weighted context vectors for different recognition tasks. We propose the WaveNet-CTC with local attention to perform multitask learning for Tibetan speech recognition, which can automatically capture the context information among different tasks.

Related Work

Methods

Findings

Experiments