Sampleformer: An efficient conformer-based Neural Network for Automatic Speech Recognition

Zeping Fan,Zhaohui Bu,Min Huang,Xuejun Zhang

doi:10.3233/ida-230612

Abstract

The Convolution-augmented Transformer (Conformer) model, which was recently introduced, has attained state-of-the-art(SOTA) results in Automatic Speech Recognition (ASR). In this paper, a series of methodical investigations uncover that the Conformer’s design decisions may not represent the most efficient choices when operating within the constraints of a limited computational budget. After a thorough re-evaluation of the Conformer architecture’s design choices, we propose Sampleformer which reduces the Conformer architecture complexity and has more robust performance. We introduce downsampling to the Conformer Encoder, and to exploit the information in the speech features, we incorporate an additional downsampling module to enhance the efficiency and accuracy of our model. Additionally, we propose a novel and adaptable attention mechanism called multi-group attention, effectively reducing the attention complexity from O⁢(n2⁢d) to O⁢(n2⁢d⋅f/g). We performed experiments on the AISHELL-1 corpora, our 13.3 million-parameter CTC model demonstrates a 3.0%/2.6% relative reduction in character error rate (CER) on the dev/test sets, all without the utilization of a language model (LM). Additionally, the model exhibits a 30% improvement in inference compared to our CTC Conformer baseline and trains 27% faster.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Sampleformer: An efficient conformer-based Neural Network for Automatic Speech Recognition

Abstract

Talk to us

Similar Papers

More From: Intelligent Data Analysis

Lead the way for us

Similar Papers

Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition
Cunhang Fan ... Zhengkun Tian
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 29
Cunhang Fan, et. al.Cunhang Fan ... Zhengkun Tian
25 Nov 2020
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 29

Cross-Lingual Language Modeling for Low-Resource Speech Recognition
Ping Xu ... P Fung
IEEE Transactions on Audio, Speech, and Language Processing | VOL. 21
Ping Xu, et. al. Ping Xu ... P Fung
01 Jun 2013
IEEE Transactions on Audio, Speech, and Language Processing | VOL. 21

Improving Speech Recognition with Augmented Synthesized Data and Conditional Model Training
Shaofei Xue ... Jian Tang
-
Shaofei Xue, et. al.Shaofei Xue ... Jian Tang
11 Dec 2022
11 Dec 2022

Speaker adaptation of deep neural networks using a hierarchy of output layers
Ryan Price ... Koichi Shinoda
-
Ryan Price, et. al.Ryan Price ... Koichi Shinoda
01 Dec 2014
01 Dec 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sampleformer: An efficient conformer-based Neural Network for Automatic Speech Recognition

Abstract

Talk to us

Similar Papers

More From: Intelligent Data Analysis