TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

Andy T Liu,Shang-Wen Li,Hung-Yi Lee

doi:10.1109/taslp.2021.3095662

Abstract

We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn by using a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous methods, we use alteration along three orthogonal axes to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from their altered counterpart, where we use a stochastic policy to alter along various dimensions: time, frequency, and magnitude. TERA can be used for speech representations extraction or fine-tuning with downstream models. We evaluate TERA on several downstream tasks, including phoneme classification, keyword spotting, speaker recognition, and speech recognition. We present a large-scale comparison of various self-supervised models. TERA achieves strong performance in the comparison by improving upon surface features and outperforming previous models. In our experiments, we study the effect of applying different alteration techniques, pre-training on more data, and pre-training on various features. We analyze different model sizes and find that smaller models are strong representation learners than larger models, while larger models are more effective for downstream fine-tuning than smaller models. Furthermore, we show the proposed method is transferable to downstream datasets not used in pre-training.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2021
Citations: 179

Similar Papers

A Target-Separable BWN Inspired Speech Recognition Processor with Low-power Precision-adaptive Approximate Computing
Bo Liu ... Hao Cai
-
Bo Liu, et. al.Bo Liu ... Hao Cai
14 Mar 2022
14 Mar 2022

Genetic Algorithm for Combined Speaker and Speech Recognition using Deep Neural Networks
Gurpreet Kaur ... Mohit Srivastava
Journal of Telecommunications and Information Technology | VOL. 2
Gurpreet Kaur, et. al.Gurpreet Kaur ... Mohit Srivastava
29 Jun 2018
Journal of Telecommunications and Information Technology | VOL. 2

IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian Languages
Tahir Javed ... Mitesh M Khapra
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 37
Tahir Javed, et. al.Tahir Javed ... Mitesh M Khapra
26 Jun 2023
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 37

Pretext Tasks Selection for Multitask Self-Supervised Audio Representation Learning
Salah Zaiem ... Titouan Parcollet
IEEE Journal of Selected Topics in Signal Processing | VOL. 16
Salah Zaiem, et. al.Salah Zaiem ... Titouan Parcollet
01 Oct 2022
IEEE Journal of Selected Topics in Signal Processing | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing