Abstract

This study investigated large-scale semi-supervised training (SST) to improve acoustic models for automatic speech recognition. The conventional self-training, the recently proposed committee-based SST using heterogeneous neural networks and the lattice-based SST were examined and compared. The large-scale SST was studied in deep neural network acoustic modeling with respect to the automatic transcription quality, the importance data filtering, the training data quantity and other data attributes of a large quantity of multi-genre unsupervised live data. We found that the SST behavior on large-scale ASR tasks was very different from the behavior obtained on small-scale SST: 1) big data can tolerate a certain degree of mislabeling in the automatic transcription for SST. It is possible to achieve further performance gains with more unsupervised fresh data, and even the automatic transcriptions have a certain degree of errors; 2) the audio attributes, transcription quality and importance of the fresh data are more important than the increased data quantity for large-scale SST; and 3) there are large differences in performance gains on different recognition tasks, such that the benefits highly depend on the selected data attributes of unsupervised data and the data scale of the baseline ASR system. Furthermore, we proposed a novel utterance filtering approach based on active learning to improve the data selection in large-scale SST. The experimental results showed that the SST with the proposed data filtering yields a 2-11% relative word error rate reduction on five multi-genre recognition tasks, even with the baseline acoustic model that was already well trained on a 10000-hr supervised dataset.

Highlights

  • The performance of automatic speech recognition (ASR) has been significantly improved in recent years with the rapid development of deep learning algorithms [1]–[4]

  • Unlike previous works on a single type of live data [11], [12], we focused on large-scale unsupervised multi-genre speech data that were collected from an online ASR engine automatically

  • It is possible to obtain further performance gains with more unsupervised fresh data, and even the automatic transcriptions have a certain degree of errors; 2) the audio attributes, transcription quality and importance of the fresh data are more important than the increased data quantity for large-scale supervised training (SST); 3) there are large differences in performance gains on different recognition tasks, such that the benefits highly depend on the selected data attributes of unsupervised data and the data scale of the baseline ASR system; and 4) the SST with the proposed active learning (AL) data filtering yields a 2-11% relative word error rate (WER) reduction on five multi-genre recognition tasks, even with the baseline acoustic model (AM) that was already trained on a 10000-hr supervised dataset

Read more

Summary

Introduction

The performance of automatic speech recognition (ASR) has been significantly improved in recent years with the rapid development of deep learning algorithms [1]–[4]. More ASR techniques have been successfully applied in industrial services. To achieve the best performance for each industrial service, constantly updating the acoustic model (AM) with fresh or live speech from the latest production traffic is very important and necessary [5], [6], especially for mobile voice search, short message dictation, smart medical voice dictation, etc. We consider that the information contained in a large quantity of live or fresh data is helpful to train a better. AM because the data property of users’ speech may deviate substantially from the existing AMs, with respect to acoustic environments, text content, etc

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call