Abstract

A proven method for achieving effective automatic speech recognition (ASR) due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN), despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN), where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i) an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER) by 24%, and (ii) for a diverse noisy speech task (SPINE 2), where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

Highlights

  • Current speaker-independent automatic speech recognition (ASR) systems perform well in most of the real-world applications but the performance gap between speaker-dependent and speaker-independent settings is still significant

  • Our approach in built-in speaker normalization (BISN) is fundamentally different in the sense that each best bilinear transform (BLT) warp factor is estimated within the vocal-tract length normalization (VTLN) framework proposed by Lee and Rose [6, 7]

  • The envelope is extracted via a low-order all-pole Minimum variance distortionless response (MVDR) spectrum which is shown to be superior to the linear prediction- (LP-) based envelopes [17]

Read more

Summary

INTRODUCTION

Current speaker-independent automatic speech recognition (ASR) systems perform well in most of the real-world applications but the performance gap between speaker-dependent and speaker-independent settings is still significant. It was computationally cumbersome and required a substantial amount of speech from each speaker in order to estimate the best warp factor Their basic motivation was to extract acoustic features that have reduced speaker dependency. VTLN is shown to be effective for a number of tasks but the computational load of determining the best warp for each speaker, especially at the time of recognition, is not tractable They proposed computationally more efficient variants of the VTLN based on the GMM modeling of each VTLN warp [6, 7]. Our approach in BISN is fundamentally different in the sense that each best BLT warp factor is estimated within the VTLN framework proposed by Lee and Rose [6, 7].

THE PMVDR ACOUSTIC FRONT END
Direct warping of FFT spectrum
Implementation of direct warping
Implementation of PMVDR
THE “MEANING” OF PERCEPTUAL WARPING
OFFLINE VTLN
Warping factor estimation
Model versus feature space search
EXPERIMENTAL FRAMEWORK
General system description
Experiments for CU-Move extended digits task
Experiments for the SPINE task
APPLICATION OF BISN IN A REAL-TIME SCENARIO
COMPUTATIONAL CONSIDERATIONS
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call