SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

Xuyi Zhuang,Yukun Qian,Mingjiang Wang

doi:10.1186/s13636-024-00375-1

Abstract

Self-supervised learning for speech pre-training models has achieved remarkable success in acquiring superior speech contextual representations by learning from unlabeled audio, excelling in numerous downstream speech tasks. However, the pre-training of these models necessitates significant computational resources and training duration, presenting a high barrier to entry into the realm of pre-training learning. In our efforts, by amalgamating the resource-efficient benefits of the generative learning model, Masked Auto Encoder, with the efficacy of the vector quantization method in discriminative learning, we introduce a novel pre-training framework: Speech Vector Quantization Masked Auto Encoder (SVQ-MAE). Distinct from the majority of SSL frameworks, which require simultaneous construction of speech contextual representations and mask reconstruction within an encoder-only module, we have exclusively designed a decoupled decoder for pre-training SVQ-MAE. This allows the additional decoupled decoder to undertake the mask reconstruction task solely, reducing the learning complexity of pretext tasks and enhancing the encoder’s efficiency in extracting speech contextual representations. Owing to this innovation, by using only 4 GPUs, SVQ-NAE can achieve high performance comparable to wav2vec 2.0, which requires 64 GPUs for training. In the Speech Processing Universal Performance Benchmark, SVQ-MAE surpasses wav2vec 2.0 in both keyword spotting and emotion recognition tasks. Furthermore, in cross-lingual ASR for Mandarin, upon fine-tuning on AISHELL-1, SVQ-MAE achieves a Character Error Rate of 4.09%, outperforming all supervised ASR models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

Abstract

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Oct 19, 2024
License type: CC BY-NC-ND 4.0

Similar Papers

Keyword Spotting using Vowel Onset Point, Vector Quantization and Hidden Markov Modeling Based techniques
B V Sandeep Reddy ... S R Mahadeva Prasanna
-
B V Sandeep Reddy, et. al.B V Sandeep Reddy ... S R Mahadeva Prasanna
01 Nov 2008
01 Nov 2008

Keyword Spotting: An Audio Mining Technique in Speech Processing – A Survey
...
-
, et. al. ...
25 Aug 2015
25 Aug 2015

Advantages and Pitfalls of Dataset Condensation: An Approach to Keyword Spotting with Time-Frequency Representations
Pedro Henrique Pereira ... Miguel Arjona Ramírez
Electronics | VOL. 13
Pedro Henrique Pereira, et. al.Pedro Henrique Pereira ... Miguel Arjona Ramírez
28 May 2024
Electronics | VOL. 13

Pre-training on Large-Scale Heterogeneous Graph
Xunqiang Jiang ... Yuan Fang
-
Xunqiang Jiang, et. al.Xunqiang Jiang ... Yuan Fang
14 Aug 2021
14 Aug 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

Abstract

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing