Automatic estimation of the first two subglottal resonances in children's speech with application to speaker normalization in limited-data conditions

Harish Arsikere,Steven M Lulich,Gary K F Leung,Abeer Alwan

doi:10.21437/interspeech.2012-388

Abstract

This paper proposes an automatic algorithm for estimating the first two subglottal resonances (SGRs)—Sg1 and Sg2— from continuous speech of children, and applies it to automatic speaker normalization in mismatched, limited-data conditions. The proposed algorithm is based on the observation that Sg1 and Sg2 form phonological vowel feature boundaries, and is motivated by our recent SGR estimation algorithm for adults. The algorithm is trained and evaluated, respectively, on 25 and 9 children, aged between 7 and 18 years. The average RMS errors incurred in estimating Sg1 and Sg2 are 55 and 144 Hz, respectively. By applying the proposed algorithm to a connected digits speech recognition task, it is shown that: 1) a linear frequency warping using Sg1 or Sg2 is comparable to or better than maximum likelihood-based vocal tract length normalization (MLVTLN), 2) the performance of SGR-based frequency warping is less content dependent than that of ML-VTLN, and 3) SGRbased frequency warping can be integrated into ML-VTLN to yield a statistically-significant improvement in performance.

Full Text