On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Annika Hämäläinen,Louis Ten Bosch,Lou Boves,Johan De Veth

doi:10.1155/2007/46460

Annika Hämäläinen, Louis Ten Bosch + Show 2 more

Open Access

https://doi.org/10.1155/2007/46460

Copy DOI

Abstract

Recent research on the TIMIT corpus suggests that longer-length acoustic models are more appropriate for pronunciation variation modelling than the context-dependent phones that conventional automatic speech recognisers use. However, the impressive speech recognition results obtained with longer-length models on TIMIT remain to be reproduced on other corpora. To understand the conditions in which longer-length acoustic models result in considerable improvements in recognition performance, we carry out recognition experiments on both TIMIT and the Spoken Dutch Corpus and analyse the differences between the two sets of results. We establish that the details of the procedure used for initialising the longer-length models have a substantial effect on the speech recognition results. When initialised appropriately, longer-length acoustic models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenomena that hinder recognition performance the most.

Highlights

Conventional large-vocabulary continuous speech recognisers use context-dependent phone models, such as triphones, to model speech
To estimate the proportion of syllable tokens that were potentially sensitive to large deviations from their canonical representation, we examined the structure of the syllables in the TIMIT database
This paper contrasted recognition results obtained using longer-length acoustic models for Dutch read speech from a library for the blind with recognition results achieved on American English read speech from TIMIT

Summary

Introduction

Conventional large-vocabulary continuous speech recognisers use context-dependent phone models, such as triphones, to model speech Apart from their capability of modelling (some) contextual effects, the main advantage of triphones is that the fixed number of phonemes in a given language guarantees their robust training when reasonable amounts of training data are available and when state tying methods are used to deal with infrequent triphones. One must assume that speech can be represented as a sequence of discrete phonemes (beads on a string) that can only be substituted, inserted, or deleted to account for pronunciation variation [1]. Given this assumption, it should be possible to account for pronunciation variation at the level of the phonetic transcriptions in the recognition lexicon. We must conclude that a representation of speech in terms of a sequence of discrete symbols is not fully adequate

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Jan 1, 2007
Citations: 25	License type: cc-by

R Discovery Prime

R Discovery Prime

On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition
Yoo Rhee Oh ... Hong Kook Kim
IEICE Transactions on Information and Systems | VOL. E93-D
Yoo Rhee Oh, et. al.Yoo Rhee Oh ... Hong Kook Kim
01 Jan 2009
IEICE Transactions on Information and Systems | VOL. E93-D

A New Corpus of Elderly Japanese Speech for Acoustic Modeling, and a Preliminary Investigation of Dialect-Dependent Speech Recognition
Meiko Fukuda ... Hiromitsu Nishizaki
-
Meiko Fukuda, et. al.Meiko Fukuda ... Hiromitsu Nishizaki
01 Oct 2019
01 Oct 2019

Flat start training of CD-CTC-SMBR LSTM RNN acoustic models
Kanishka Rao ... Hasim Sak
-
Kanishka Rao, et. al.Kanishka Rao ... Hasim Sak
01 Mar 2016
01 Mar 2016

Achieving a reliable compact acoustic model for embedded speech recognition system with high confusion frequency model handling
Junho Park ... Hanseok Ko
Speech Communication | VOL. 48
Junho Park, et. al.Junho Park ... Hanseok Ko
11 Nov 2005
Speech Communication | VOL. 48

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing