Abstract

This article describes a Vocal Tract Length Nor­ malization (VTLN) procedure through frequency warping based on pitch estimates. This procedure aims to reduce the inter-speaker variability of speech signals in order to obtain a robust automatic speech recognition system. Two additional methods are also described: one for reducing the environment variability and another for compensating the coarticulation effects on connected word pronunciation. En­ vironment variability is compensated by explicitly modeling some frequent noise phenomena. Coarticulation phenomena compensation reduces speech signal variability by modeling events that result from coarticulation between adjacent mod­ els. Inter-speaker variability removal is performed by a traditional speaker normalization method, which consists in expanding or compressing the Mel filterbank bandwidths, in order to normalize the Vocal Tract Length (VTL) of each speaker. Most of the existing methods for VTL estimation are based on formant estimation, but the difficulty of formant estimation is a known performance limitation. The proposed method over­ comes such a problem since it estimates the warping factor through pitch. The recognition results, obtained for a tele­ phone digit recognition task (with phones and sub words as units), prove that this procedure leads to similar improve­ ments to those obtained with traditional methods based on formant estimates, actually outperforming them in some sit­ uations.

Highlights

  • The task of efficiently recognize spoken credit card num­ bers, telephone numbers or any other identification number is extremely important and requires an almost ideal recognition rate

  • Despite the good performance of the presently available Automatic Speech Recog­ nition (ASR) systems, the recognition robustness which allow the system to operate with an unlimited vocabulary, has not been fully achieved yet; if speaker independency and en­ vironment changes are taken into account

  • This work looks out for a valid contribution for solving the cen­ tral problem of ASR system robustness. This is achieved by modulating some variability agents present on speech signals, namely inter-speaker variability, environment variability and the variability imposed by an overwhelming number of coar­

Read more

Summary

INTRODUCTION

The task of efficiently recognize spoken credit card num­ bers, telephone numbers or any other identification number is extremely important and requires an almost ideal recognition rate. This work looks out for a valid contribution for solving the cen­ tral problem of ASR system robustness This is achieved by modulating some variability agents present on speech signals, namely inter-speaker variability, environment variability and the variability imposed by an overwhelming number of coar­. Speaker normaliza­ tion aims to reduce the speech variability resulting from dif­ ferences between speakers - the so-called inter-speaker vari­ ability These differences are essentially related to vocal and nasal tract shape and length, vocal chord physiology and to gender and age. In this work it is investi­ gated the advantage of separating the models by speaker gen­ der.

SPEAKER VARIABILITY AGENTS
SPEECH DATA
RECOGNITION UNITS
COARTICULATION MODELING
GENDER DEPENDENT SYSTEM
PITCH BASED FREQUENCY WARPING
PITCH BASED FREQUENCY WARPING EXPERIMENTS
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call