Deep joint learning for language recognition

Lin Li,Zheng Li,Yan Liu,Qingyang Hong

doi:10.1016/j.neunet.2021.03.026

Abstract

Deep learning methods for language recognition have achieved promising performance. However, most of the studies focus on frameworks for single types of acoustic features and single tasks. In this paper, we propose the deep joint learning strategies based on the Multi-Feature (MF) and Multi-Task (MT) models. First, we investigate the efficiency of integrating multiple acoustic features and explore two kinds of training constraints, one is introducing auxiliary classification constraints with adaptive weights for loss functions in feature encoder sub-networks, and the other option is introducing the Canonical Correlation Analysis (CCA) constraint to maximize the correlation of different feature representations. Correlated speech tasks, such as phoneme recognition, are applied as auxiliary tasks in order to learn related information to enhance the performance of language recognition. We analyze phoneme-aware information from different learning strategies, like joint learning on the frame-level, adversarial learning on the segment-level, and the combination mode. In addition, we present the Language-Phoneme embedding extraction structure to learn and extract language and phoneme embedding representations simultaneously. We demonstrate the effectiveness of the proposed approaches with experiments on the Oriental Language Recognition (OLR) data sets. Experimental results indicate that joint learning on the multi-feature and multi-task models extracts instinct feature representations for language identities and improves the performance, especially in complex challenges, such as cross-channel or open-set conditions.

Full Text