Abstract

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

Highlights

  • We have provided extensive research on how Byte Pair Encoding (BPE)-dropout and the unigram language model (ULM) subword regularization acoustic unit augmentations contribute to the performance of strong end-to-end automatic speech recognition (ASR) system baselines in low-resource conditions

  • We have proposed a method of dynamic acoustic unit augmentation based on the BPE-dropout technique

  • This method allows for improved ASR system quality at no additional training and decoding computational cost

Read more

Summary

Introduction

The personalization itself is more complicated for this task than for typing-based search since it starts before ranking the results at the speech recognition stage. The part of voice search, which is responsible for transducing speech to words and passing them to the search field, can be thought of as the large vocabulary continuous speech recognition (LVCSR) task of automatic speech recognition (ASR). One of the main challenges in this task is the recognition of words that the ASR system has not encountered before; such words are called out-of-vocabulary (OOV). Recognition errors for such words occur more often than those that the system is aware of. An incorrect voice search may decrease the user-perceived quality of the whole system

Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call