Telephone Speech Research Articles

It is often acknowledged that speech signals contain short-term and long-term temporal properties [Rabiner, L., Juang, B.H., 1993. Fundamentals of Speech Recognition, Prentice-Hall, NJ, USA] that are difficult to capture and model by using the usual fixed scale (typically 20 ms) short-time spectral analysis used in hidden Markov models (HMMs), based on piecewise stationarity and state conditional independence assumptions of acoustic vectors. For example, vowels are typically quasi-stationary over 40–80 ms segments, while plosive typically require analysis below 20 ms segments. Thus, fixed scale analysis is clearly sub-optimal for “optimal” time–frequency resolution and modeling of different stationary phones found in the speech signal. In the present paper, we investigate the potential advantages of using variable size analysis windows towards improving state-of-the-art speech recognition systems. Based on the usual assumption that the speech signal can be modeled by a time-varying autoregressive (AR) Gaussian process, we estimate the largest piecewise quasi-stationary speech segments, based on the likelihood that a segment was generated by the same AR process. This likelihood is estimated from the linear prediction (LP) residual error. Each of these quasi-stationary segments is then used as an analysis window from which spectral features are extracted. Such an approach thus results in a variable-scale time spectral analysis, adaptively estimating the largest possible analysis window size such that the signal remains quasi-stationary, thus the best temporal/frequency resolution tradeoff. The speech recognition experiments on the OGI Numbers95 database [Cole, R.A., Fanty, M., Lander, T., 1994. Telephone speech corpus at CSLU. In: Proc. of ICSLP, Yokohama, Japan.], show that the proposed variable-scale piecewise stationary spectral analysis based features indeed yield improved recognition accuracy in clean conditions, compared to features based on minimum cross entropy spectrum [Loughlin, P., Pitton, J., Hannaford, B., 1994. Approximating time–frequency density functions via optimal combinations of spectrograms, IEEE Signal Process. Lett. 1 (12)] as well as those based on fixed scale spectral analysis.

In this paper, we present a new keyword spotting technique. A critical issue in keyword spotting is the explicit modeling of the non-keyword portions. To date, most keyword spotters use a set of Hidden Markov Models (HMM) to represent the non-keyword portions. A widely used approach is to split the training data into keyword and non-keyword data. The keywords are represented by HMMs trained using the keyword speech, and the garbage models are trained using the non-keyword speech. The main disadvantage of this method is the task dependence. Another approach is to use a common set of acoustic models for both keywords and garbage models. However, this method faces a major problem. In a keyword spotter, the garbage models are usually connected to allow any sequence. Therefore, the keywords are also included in these sequences. When the same training data are used for keyword and garbage models, the garbage models also cover the keywords. In order to overcome these problems, we propose a new method for modeling the non-keyword intervals. In our method, the garbage models are phonemic HMMs trained using a speech corpus of a language other than—but acoustically similar to—the target language. In our work, the target language is Japanese and, due to the high similarity, English was chosen as the ‘garbage language’ for training the garbage models. Using English garbage models—instead of Japanese—our method achieves higher performance, compared with when Japanese garbage models are used. Moreover, parameter tuning (e.g., word insertion penalty) does not have a serious effect on the performance when English garbage models are used. Using clean telephone speech test data and a vocabulary of 100 keywords, we achieved a 7.9% equal error rate which is a very promising result. In this paper we also introduce results obtained using several sizes of vocabulary, and we investigate the selection of the most appropriate garbage model set. In addition to the Japanese keyword spotting system, we also introduce results of an English keyword spotter. By using Japanese garbage models—instead of English—we achieved significant improvement. Using telephone speech test data and a vocabulary of 25 keywords the achieved Figure of Merit (FOM) was 74.7% compared to 68.9% when English garbage models were used.

Telephone Speech Research Articles

Related Topics

Articles published on Telephone Speech

Discriminative cluster adaptive training

Analysis of the errors produced by the 2004 BBN speech recognition system in the DARPA EARS evaluations

Opinions on Cochlear Implant Use in Senior MED-EL Patients

Comparison of different mobile telephones in Combi40 + users

On variable-scale piecewise stationary spectral analysis of speech signals for ASR

Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis

Noise robust bandwidth extension of telephone speech for mobile and landline communications

Minimum phone error training of precision matrix models

Interruption/Interaction/Collaboration: A Critical Appraisal of the Textual @traction Interactive Event

Corrections to “Automatic Transcription of Conversational Telephone Speech”

Random forests and the data sparseness problem in language modeling

Language identification using acoustic log-likelihoods of syllable-like units

Automatic transcription of conversational telephone speech

Techniques for artificial bandwidth extension of telephone speech

Unsupervised speaker indexing using generic models

Aural and automatic forensic speaker recognition in mismatched conditions

A study in machine learning from imbalanced data for sentence boundary detection in speech

Acoustic correlates of non-modal phonation in telephone speech

A novel approach for modeling non-keyword intervals in a keyword spotter exploiting acoustic similarities of languages

Discriminación verbal a través del teléfono en pacientes implantados con un Combi 40+

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Telephone Speech Research Articles

Related Topics

Articles published on Telephone Speech

Discriminative cluster adaptive training

Analysis of the errors produced by the 2004 BBN speech recognition system in the DARPA EARS evaluations

Opinions on Cochlear Implant Use in Senior MED-EL Patients

Comparison of different mobile telephones in Combi40 + users

On variable-scale piecewise stationary spectral analysis of speech signals for ASR

Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis

Noise robust bandwidth extension of telephone speech for mobile and landline communications

Minimum phone error training of precision matrix models

Interruption/Interaction/Collaboration: A Critical Appraisal of the Textual @traction Interactive Event

Corrections to “Automatic Transcription of Conversational Telephone Speech”

Random forests and the data sparseness problem in language modeling

Language identification using acoustic log-likelihoods of syllable-like units

Automatic transcription of conversational telephone speech

Techniques for artificial bandwidth extension of telephone speech

Unsupervised speaker indexing using generic models

Aural and automatic forensic speaker recognition in mismatched conditions

A study in machine learning from imbalanced data for sentence boundary detection in speech

Acoustic correlates of non-modal phonation in telephone speech

A novel approach for modeling non-keyword intervals in a keyword spotter exploiting acoustic similarities of languages

Discriminación verbal a través del teléfono en pacientes implantados con un Combi 40+