Abstract

Recently End-to-end (E2E) Automatic Speech Recognition (ASR) has been widely used due to its advantages over the hybrid method. Even though existing E2E ASR models have achieved impressive performance, they usually take a large model size and suffer from a slow inference speed in real-world applications. To obtain faster models for E2E ASR, we propose searching faster architectures with the help of neural architecture search (NAS) in this paper, named SFA. SFA consists of one search space that contains a set of candidate architectures and one search algorithm responsible for searching the optimal architecture from the search space. On one hand, SFA designs a topology-fused search space to integrate different topologies of existing architectures (e.g. Transformer, Conformer) and explore more brand-new ones. On the other hand, combined with the training criterion of E2E ASR, SFA develops a speed-aware differentiable search algorithm to search faster architectures according to target hardware devices. Additionally, a connectionist temporal classification based progressive search algorithm is proposed to reduce the difficulty of the architecture search and obtain better performance. On two commonly-used Mandarin datasets, SFA can effectively improve the inference speed of existing E2E ASR models with comparable performance and achieve at most 2.46×/ 1.98× CPU/GPU speedup than the best human-designed baselines.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call