Language identification (LID) is a key component in downstream tasks. Recently, the self-supervised speech representation learned by Wav2Vec 2.0 (W2V2) has been demonstrated to be very effective for various speech-related tasks. In LID, it is commonly used as a feature extractor for frame-level feature extraction. However, there is currently no effective method for extracting temporal information from frame-level features to enhance the performance of LID systems. To deal with this issue, we propose a LID framework based on deep temporal representation (DTR) learning. First, the W2V2 model is used as a front-end feature extractor. This model can capture contextual representations from continuous raw audio in which temporal dependencies are embedded. Then, a temporal network responsible for learning temporal dependencies is proposed to process the output of W2V2. This temporal network comprises a temporal representation extractor for extracting utterance-level representations and a temporal regularization term to impose constraints on temporal dynamics. Finally, the temporal dependencies are used as utterance-level representations for the subsequent classification. The proposed DTR method is evaluated on the OLR2020 database and compared to other state-of-the-art methods. The results show that the proposed method achieves decent experimental performance on all the three tasks of OLR2020 database.
Read full abstract