Temporal feature extraction based on CNN-BLSTM and temporal pooling for language identification

Xiuyan Liu,Chen Chen,Yongjun He

doi:10.1016/j.apacoust.2022.108854

Abstract

In this paper, a temporal feature extraction method based on convolutional neural network-bidirectional long-short term memory (CNN-BLSTM) and temporal pooling (TMPOOL) is proposed for language identification. First, the CNN-BLSTM model is employed as a front-end local feature extractor which learns temporal representation from acoustic features in both forward and backward direction. Then the temporal pooling unit, which is a non-linear support vector regression (SVR) machine, can efficiently learn the ordering relationship between the hidden states of BLSTM and its time indexes. At last, this ordering relationship is utilized as an utterance-level representation. Furthermore, we conducted the experiments on three tasks of the oriental language recognition (OLR-2019) challenge. Compared with other CNN (BLSTM) methods, the proposed method achieves comparable error reductions.

Full Text