Vocal Tract Length Normalization Research Articles

Recognizing children’s speech on automatic speech recognition (ASR) systems developed using adults’ speech is a very challenging task. As reported by several earlier works, a severely degraded recognition performance is observed in such ASR tasks. This is mainly due to the gross mismatch in the acoustic and linguistic attributes between those two groups of speakers. One among the various identified sources of mismatch is that the vocal organs of the adult and child speakers are of significantly different dimensions. Feature-space normalization techniques are noted to effectively address the ill-effects arising from those differences. Two most commonly used approaches are the vocal-tract length normalization and the feature-space maximum-likelihood linear regression. Another important mismatch factor is the large variation in the average pitch values across the adult and child speakers. Addressing the ill-effects introduced by the pitch differences is the primary focus of the presented study. In this regard, we have explored the feasibility of explicitly changing the pitch of the children’s speech so that observed pitch differences between the two groups of speaker are reduced. In general, speech data from children is high-pitched in comparison with that from the adults’. Consequently, in this study, the pitch of the adults’ speech used for training the ASR system is kept unchanged while that for the children’s test speech data is reduced. Significant improvement in the recognition performance is noted by this explicit reduction of pitch. To conserve the critical spectral information and to avoid introducing perceptual artifacts, we have exploited timescale modification techniques for explicit pitch mapping. Furthermore, we also presented two schemes to automatically determine the factor by which the pitch of the given test data should be varied. Automatically determining the compensation factor is critical since an ASR system is expected to be accessed by both adult and child speakers. The effectiveness of proposed techniques is evaluated on adult data trained ASR systems employing different acoustic modeling approaches, viz. Gaussian mixture modeling (GMM), subspace GMM and deep neural networks (DNN). The proposed techniques are found to be highly effective in all the explored modeling paradigms. To further study the effectiveness of the proposed approaches, another DNN-based ASR system is developed on a mix of speech data from adult as well as child speakers. The use of pitch reduction is observed to be effective even in this case.

Read full abstract

This study explores issues in automatic speech recognition (ASR) of children’s speech on acoustic models trained using adults’ speech. For acoustic modeling in ASR, the employed front-end features capture the characteristics of the vocal filter while smoothing out those of the source (excitation). Adults’ and children’s speech differ significantly due to large deviation in the acoustic correlates such as pitch, formants, speaking rate, etc. In the context of children’s speech recognition on mismatched acoustic models, the recognition rates remain highly degraded despite use of the vocal tract length normalization (VTLN) for addressing formant mismatch. For commonly used mel-filterbank-based cepstral features, earlier studies have shown that the acoustic mismatch is exacerbated by insufficient smoothing of pitch harmonics for child speakers. To address this problem, a structured low-rank projection of the test features as well as that of the mean and the covariance parameters of the acoustic models was explored in an earlier work. In this paper, a low-latency adaptation scheme is presented for children’s mismatched ASR. The presented fast adaptation approach exploits the earlier reported low-rank projection technique in order to reduce the computational cost. In the proposed approach, developmental data from the children’s domain is partitioned into separate groups on the basis of their estimated VTLN warp factors. A set of adapted acoustic models is then created by combining the low-rank projection with the model space adaptation technique for each of the warp factors. Given the children’s test utterance, first an appropriate pre-adapted model mean supervector is chosen based on its estimated warp factor. The chosen supervector is then optimally scaled. Consequently, only two parameters are required to be estimated, i.e., a warp factor and a model mean scaling factor. Even with such stringent constraints, the proposed adaptation technique results in a relative improvement of about $$44\%$$ over the VTLN included baseline.

Read full abstract

Vocal Tract Length Normalization Research Articles

Related Topics

Articles published on Vocal Tract Length Normalization

Vocal Tract Length Normalization

Assessment of pitch-adaptive front-end signal processing for children’s speech recognition

Explicit Pitch Mapping for Improved Children’s Speech Recognition

A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models

DNNs for unsupervised extraction of pseudo speaker-normalized features without explicit adaptation data

Evaluation of the Vocal Tract Length Normalization Based Classifiers for Speaker Verification

English

Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children

Subject-independent acoustic-to-articulatory mapping of fricative sounds by using vocal tract length normalization

Investigation of Speaker Group-Dependent Modelling for Recognition of Affective States from Speech

Frequency Warping for Speaker Adaptation in HMM-based Speech Synthesis

Combining Vocal Tract Length Normalization With Hierarchical Linear Transformations

Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization

Speaker normalization in noisy environments using subglottal resonances

On Acoustic Emotion Recognition: Compensating for Covariate Shift

Rapid speaker adaptation in latent speaker space with non-negative matrix factorization

Estimated relative vocal tract lengths from vowel spectra based on fundamental frequency adaptive analyses and their relations to relevant physical data of speakers

한국어 유아 음성인식을 위한 수정된 Mel 주파수 캡스트럼

On Reducing Harmonic and Sampling Distortion in Vocal Tract Length Normalization

감정 음성 인식을 위한 강인한 음성 파라메터

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Vocal Tract Length Normalization Research Articles

Related Topics

Articles published on Vocal Tract Length Normalization

Vocal Tract Length Normalization

Assessment of pitch-adaptive front-end signal processing for children’s speech recognition

Explicit Pitch Mapping for Improved Children’s Speech Recognition

A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models

DNNs for unsupervised extraction of pseudo speaker-normalized features without explicit adaptation data

Evaluation of the Vocal Tract Length Normalization Based Classifiers for Speaker Verification

English

Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children

Subject-independent acoustic-to-articulatory mapping of fricative sounds by using vocal tract length normalization

Investigation of Speaker Group-Dependent Modelling for Recognition of Affective States from Speech

Frequency Warping for Speaker Adaptation in HMM-based Speech Synthesis

Combining Vocal Tract Length Normalization With Hierarchical Linear Transformations

Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization

Speaker normalization in noisy environments using subglottal resonances

On Acoustic Emotion Recognition: Compensating for Covariate Shift

Rapid speaker adaptation in latent speaker space with non-negative matrix factorization

Estimated relative vocal tract lengths from vowel spectra based on fundamental frequency adaptive analyses and their relations to relevant physical data of speakers

한국어 유아 음성인식을 위한 수정된 Mel 주파수 캡스트럼

On Reducing Harmonic and Sampling Distortion in Vocal Tract Length Normalization

감정 음성 인식을 위한 강인한 음성 파라메터