Explicit Pitch Mapping for Improved Children’s Speech Recognition

Hemant Kumar Kathania,A B Samaddar,Waquar Ahmad,S Shahnawazuddin

doi:10.1007/s00034-017-0652-0

Abstract

Recognizing children’s speech on automatic speech recognition (ASR) systems developed using adults’ speech is a very challenging task. As reported by several earlier works, a severely degraded recognition performance is observed in such ASR tasks. This is mainly due to the gross mismatch in the acoustic and linguistic attributes between those two groups of speakers. One among the various identified sources of mismatch is that the vocal organs of the adult and child speakers are of significantly different dimensions. Feature-space normalization techniques are noted to effectively address the ill-effects arising from those differences. Two most commonly used approaches are the vocal-tract length normalization and the feature-space maximum-likelihood linear regression. Another important mismatch factor is the large variation in the average pitch values across the adult and child speakers. Addressing the ill-effects introduced by the pitch differences is the primary focus of the presented study. In this regard, we have explored the feasibility of explicitly changing the pitch of the children’s speech so that observed pitch differences between the two groups of speaker are reduced. In general, speech data from children is high-pitched in comparison with that from the adults’. Consequently, in this study, the pitch of the adults’ speech used for training the ASR system is kept unchanged while that for the children’s test speech data is reduced. Significant improvement in the recognition performance is noted by this explicit reduction of pitch. To conserve the critical spectral information and to avoid introducing perceptual artifacts, we have exploited timescale modification techniques for explicit pitch mapping. Furthermore, we also presented two schemes to automatically determine the factor by which the pitch of the given test data should be varied. Automatically determining the compensation factor is critical since an ASR system is expected to be accessed by both adult and child speakers. The effectiveness of proposed techniques is evaluated on adult data trained ASR systems employing different acoustic modeling approaches, viz. Gaussian mixture modeling (GMM), subspace GMM and deep neural networks (DNN). The proposed techniques are found to be highly effective in all the explored modeling paradigms. To further study the effectiveness of the proposed approaches, another DNN-based ASR system is developed on a mix of speech data from adult as well as child speakers. The use of pitch reduction is observed to be effective even in this case.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Explicit Pitch Mapping for Improved Children’s Speech Recognition

Abstract

Talk to us

Similar Papers

More From: Circuits, Systems, and Signal Processing

Lead the way for us

Journal: Circuits, Systems, and Signal Processing	Publication Date: Sep 11, 2017
Citations: 11

Similar Papers

Developing children’s speech recognition system for low resource Punjabi language
Virender Kadyan ... Amitoj Singh
Applied Acoustics | VOL. 178
Virender Kadyan, et. al.Virender Kadyan ... Amitoj Singh
22 Mar 2021
Applied Acoustics | VOL. 178

Creating speaker independent ASR system through prosody modification based data augmentation
S Shahnawazuddin ... B Tarun Sai
Pattern Recognition Letters | VOL. 131
S Shahnawazuddin, et. al.S Shahnawazuddin ... B Tarun Sai
30 Dec 2020
Pattern Recognition Letters | VOL. 131

Developing speaker independent ASR system using limited data through prosody modification based on fuzzy classification of spectral bins
S Shahnawazuddin ... Hemant K Kathania
Digital Signal Processing | VOL. 93
S Shahnawazuddin, et. al.S Shahnawazuddin ... Hemant K Kathania
11 Jul 2019
Digital Signal Processing | VOL. 93

FMLLR Speaker Normalization With i-Vector: In Pseudo-FMLLR and Distillation Framework
Neethu Mariam Joy ... Srinivasan Umesh
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 26
Neethu Mariam Joy, et. al.Neethu Mariam Joy ... Srinivasan Umesh
01 Apr 2018
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Explicit Pitch Mapping for Improved Children’s Speech Recognition

Abstract

Talk to us

Similar Papers

More From: Circuits, Systems, and Signal Processing