Combination of multiple acoustic models with unsupervised adaptation for lecture speech transcription

Peng Shen,Xugang Lu,Xinhui Hu,Naoyuki Kanda,Masahiro Saiko,Chiori Hori,Hisashi Kawai

doi:10.1016/j.specom.2016.05.001

Abstract

Automatic speech recognition systems (ASR) have achieved considerable progress in real applications because of skilled design of the architecture with advanced techniques and algorithms. However, how to design a system efficiently integrating these various techniques to obtain advanced performance is still a challenging task. In this paper, we introduced an ensemble model combination and adaptation based ASR system with two characteristics: (1) large-scale combination of multiple ASR systems based on a Recognizer Output Voting Error Reduction (ROVER) system, and (2) multi-pass unsupervised speaker adaptation for deep neural network acoustic models and topic adaptation on language model. The multiple acoustic models were trained with different acoustic features and model architectures which helped to provide complementary and discriminative information in the ROVER process. With these multiple acoustic models, a better estimation of word confidence could be obtained from ROVER process which helped in selecting data for unsupervised adaptation on the previously trained acoustic models. The final recognition result was obtained using multi-pass decoding, ROVER, and adaptation processes. We tested the system on lecture speeches with topics related to Technology, Entertainment and Design (TED) that were used in the international workshop on spoken language translation (IWSLT) evaluation campaign, and obtained 6.5%, 7.0%, 10.6%, and 8.4% word error rates for test sets in 2011, 2012, 2013, and 2014, which to our knowledge are the best results for these evaluation sets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Combination of multiple acoustic models with unsupervised adaptation for lecture speech transcription

Abstract

Talk to us

Similar Papers

More From: Speech Communication

Lead the way for us

Journal: Speech Communication	Publication Date: May 24, 2016
Citations: 7

Similar Papers

A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)
J.G Fiscus
-
J.G FiscusJ.G Fiscus
14 Dec 1997
14 Dec 1997

Gaussian mixture models for adaptation of deep neural network acoustic models in automatic speech recognition systems
N.A Tomashenko ... Yu.N Matveev
Scientific and Technical Journal of Information Technologies, Mechanics and Optics | VOL. 106
N.A Tomashenko, et. al.N.A Tomashenko ... Yu.N Matveev
15 Nov 2016
Scientific and Technical Journal of Information Technologies, Mechanics and Optics | VOL. 106

Unsupervised adaptation of student DNNS learned from teacher RNNS for improved ASR performance
Lahiru Samarakoon ... Brian Mak
-
Lahiru Samarakoon, et. al.Lahiru Samarakoon ... Brian Mak
01 Dec 2017
01 Dec 2017

Exploring recurrent neural network based acoustic and linguistic modeling for children's speech recognition
Sreeram Ganji ... Rohit Sinha
-
Sreeram Ganji, et. al.Sreeram Ganji ... Rohit Sinha
01 Nov 2017
01 Nov 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combination of multiple acoustic models with unsupervised adaptation for lecture speech transcription

Abstract

Talk to us

Similar Papers

More From: Speech Communication