Abstract

Self-supervised learning for speech pre-training models has achieved remarkable success in acquiring superior speech contextual representations by learning from unlabeled audio, excelling in numerous downstream speech tasks. However, the pre-training of these models necessitates significant computational resources and training duration, presenting a high barrier to entry into the realm of pre-training learning. In our efforts, by amalgamating the resource-efficient benefits of the generative learning model, Masked Auto Encoder, with the efficacy of the vector quantization method in discriminative learning, we introduce a novel pre-training framework: Speech Vector Quantization Masked Auto Encoder (SVQ-MAE). Distinct from the majority of SSL frameworks, which require simultaneous construction of speech contextual representations and mask reconstruction within an encoder-only module, we have exclusively designed a decoupled decoder for pre-training SVQ-MAE. This allows the additional decoupled decoder to undertake the mask reconstruction task solely, reducing the learning complexity of pretext tasks and enhancing the encoder’s efficiency in extracting speech contextual representations. Owing to this innovation, by using only 4 GPUs, SVQ-NAE can achieve high performance comparable to wav2vec 2.0, which requires 64 GPUs for training. In the Speech Processing Universal Performance Benchmark, SVQ-MAE surpasses wav2vec 2.0 in both keyword spotting and emotion recognition tasks. Furthermore, in cross-lingual ASR for Mandarin, upon fine-tuning on AISHELL-1, SVQ-MAE achieves a Character Error Rate of 4.09%, outperforming all supervised ASR models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.