Abstract

Manual transcription of audio databases for the development of automatic speech recognition (ASR) systems is a costly and time-consuming process. In the context of deriving acoustic models adapted to a specific application, or in low-resource scenarios, it is therefore essential to explore alternatives capable of improving speech recognition results. In this paper, we investigate the relevance of foreign data characteristics, in particular domain and language, when using this data as an auxiliary data source for training ASR acoustic models based on deep neural networks (DNNs). The acoustic models are evaluated on a challenging bilingual database within the scope of the MediaParl project. Experimental results suggest that in-language (but out-of-domain) data is more beneficial than in-domain (but out-of-language) data when employed in either supervised or semi-supervised training of DNNs. The best performing ASR system, an HMM/GMM acoustic model that exploits DNN as a discriminatively trained feature extractor outperforms the best performing HMM/DNN hybrid by about 5 % relative (in terms of WER). An accumulated relative gain with respect to the MFCC-HMM/GMM baseline is about 30 % WER.

Highlights

  • Current automatic speech recognition (ASR) systems are based on statistical parametric methodologies and require large amounts of transcribed speech data during training

  • Results in terms of word error rates (WERs) on the evaluation set of MediaParl provides h of French data (MP-FR) indicate significantly better performance of the deep neural networks (DNNs)-based systems compared to hidden Markov model (HMM)/GMM

  • 5.2 More data helps In line with hypothesis one, we explore whether commonly used French corpora of transcribed speech can improve the performance of the MP-FR ASR system

Read more

Summary

Introduction

Current automatic speech recognition (ASR) systems are based on statistical parametric methodologies and require large amounts of transcribed speech data during training. There is a long-standing belief that “there is no data like more data” in the speech recognition community. In this spirit, a number of efforts have been undertaken to transcribe large amounts of speech data (i.e., the GALE project [1]) in order to improve performance. Transcribing speech is usually an expensive manual process. Several efforts towards the use of untranscribed data during training have been made in the past. The performance gains quickly saturate when continuously adding more data

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call