Context-dependent Phone Research Articles

This paper describes the specification, design and development phases of two widely used telephone services based on automatic speech recognition. The effort spent for evaluating and tuning these services will be discussed in detail. In developing the first service, mainly based on the recognition of “alphanumeric” sequences, a significant part of the work consisted in refining the acoustic models. To increase recognition accuracy we adopted algorithms and methods consolidated in the past over broadcast news transcription tasks. A significant result shows that the use of task specific context dependent phone models reduces the word error rate by about 40% relative to using context independent phone models. Note that the latter result was achieved over a small vocabulary task, significantly different from those generally used in broadcast news transcription. We also investigated both unsupervised and supervised training procedures. Moreover, we studied a novel partly supervised technique that allows us to select in some “optimal” way the speech material to manually transcribe and use for acoustic model training. A significant result shows that the proposed procedure gives performance close to that obtained with a completely supervised training method. In the second service, mainly based on phrase spotting, a wide effort was devoted to language model refinement. In particular, several types of rejection networks were studied to detect out of vocabulary words for the given task; a major result demonstrates that using rejection networks based on a class trigram language model reduces the word error rate from 36.7% to 11.1% with respect to using a phone loop network. For the latter service, the benefits and related costs brought by regular grammars, stochastic language models and mixed language models will be also reported and discussed. Finally, notice that most of experiments described in this paper were carried out on field databases collected through the developed services.

Read full abstract

A new phone recognizer has been implemented which extends the (phonotactic) decoding constraint to sequences of three phones. It is based on a structure similar to a second order ergodic hidden Markov model (HMM). This kind of a model assumes direct correspondence between the model states and phones, thus constraints on possible state sequences are equivalent to phonotactic constraints. Very high coverage by both left and right context-dependent phone models has been achieved using two methods. The first assumes that some contexts have the same or very similar effect on the phone in question. Thus they are merged into the same contextual class. The outcome is a set of 19 left context classes and 18 right context classes. The second assumes that left context mostly influences the beginning of a phone, whereas the right context influences the end of the phone. Each phone (a state in an ergodic HMM) is represented by a sequence of three probability density functions (pdfs), which is similar to a three state left-to-right HMM. We generate acoustic models such that the first pdf in the model is conditioned on the left context, the middle pdf is context independent (or it can also be context dependent), and the last pdf is conditioned on the right context. A large number of such quasi-triphonic acoustic models can be generated, thus providing a good triphone coverage for a given task, efficiently utilizing the available training data set. The current implementations of the recognizer described here have been applied to the DARPA Resource Management Task to demonstrate feasibility of performing phone (not phoneme) recognition using an untranscribed database, and the TIMIT database, for comparison to existing phone recognition systems. Since true phone sequences for the training utterances are not available for the RM database, they are estimated from text using a phone realization classification tree trained on the TIMIT database transcriptions. The estimates of the true phone sequences are used in training the models and generating reference phone sequences for scoring. The best phone recognition match between the most likely path through the classification tree and the phone recognizer output for the DARPA February 89 test set was 80·5% accurate and 84·0% correct. The best result obtained using the same recognizer structure on the TIMIT database is 69·4% accurate and 74·8% correct, which is a significant improvement over the best published result, when they are both reduced to the same phone set.

Read full abstract

Context-dependent Phone Research Articles

Related Topics

Articles published on Context-dependent Phone

Standard Yorùbá context dependent tone identification using Multi-Class Support Vector Machine (MSVM)

Investigation of Various Hybrid Acoustic Modeling Units via a Multitask Learning and Deep Neural Network Technique for LVCSR of the Low-Resource Language, Amharic

Heterophonic speech recognition using composite phones.

A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition

Chinese-English Phone Set Construction for Code-Switching ASR Using Acoustic and DNN-Extracted Articulatory Features

Integrated exemplar-based template matching and statistical modeling for continuous speech recognition

Matching Criteria for Vocabulary-Independent Search

Syllable modeling in continuous speech recognition for Tamil language

Modelling pronunciation variation with single-path and multi-path syllable models: Issues to consider

On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Design and evaluation of acoustic and language models for large scale telephone services

Specifics of Hidden Markov Model Modifications for Large Vocabulary Continuous Speech Recognition

Syllable-based large vocabulary continuous speech recognition

MDL-based context-dependent subword modeling for speech recognition.

An efficient search space representation for large vocabulary continuous speech recognition

Speech recognition and synthesis technology development at NTT for telecommunications services

Speaker-independent continuous speech dictation

High accuracy phone recognition using context clustering and quasi-triphonic models

Improved acoustic modeling for large vocabulary continuous speech recognition

Word juncture modeling using phonological rules for HMM-based continuous speech recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Context-dependent Phone Research Articles

Related Topics

Articles published on Context-dependent Phone

Standard Yorùbá context dependent tone identification using Multi-Class Support Vector Machine (MSVM)

Investigation of Various Hybrid Acoustic Modeling Units via a Multitask Learning and Deep Neural Network Technique for LVCSR of the Low-Resource Language, Amharic

Heterophonic speech recognition using composite phones.

A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition

Chinese-English Phone Set Construction for Code-Switching ASR Using Acoustic and DNN-Extracted Articulatory Features

Integrated exemplar-based template matching and statistical modeling for continuous speech recognition

Matching Criteria for Vocabulary-Independent Search

Syllable modeling in continuous speech recognition for Tamil language

Modelling pronunciation variation with single-path and multi-path syllable models: Issues to consider

On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Design and evaluation of acoustic and language models for large scale telephone services

Specifics of Hidden Markov Model Modifications for Large Vocabulary Continuous Speech Recognition

Syllable-based large vocabulary continuous speech recognition

MDL-based context-dependent subword modeling for speech recognition.

An efficient search space representation for large vocabulary continuous speech recognition

Speech recognition and synthesis technology development at NTT for telecommunications services

Speaker-independent continuous speech dictation

High accuracy phone recognition using context clustering and quasi-triphonic models

Improved acoustic modeling for large vocabulary continuous speech recognition

Word juncture modeling using phonological rules for HMM-based continuous speech recognition