Abstract
In this paper, we propose using deep neural networks (DNN) as a regression model to estimate speaker-normalized features from un-normalized features. We consider three types of speaker-specific feature normalization techniques, viz., feature-space maximum likelihood linear regression (FMLLR), vocal tract length normalization (VTLN) and a combination of both. The various un-normalized features considered were log filterbank features, Mel frequency cepstral coefficients (MFCC) and linear discriminant analysis (LDA) features. The DNN is trained using pairs of un-normalized features as input and corresponding speaker-normalized features as target. The network is optimized to reduce the mean square error between output and target speaker-normalized features. During test, un-normalized features are passed through this well trained DNN network to obtain pseudo speaker-normalized features without any supervision or adaptation data or first pass decode. As the pseudo speaker-normalized features are generated frame-by-frame, the proposed method requires no explicit adaptation data unlike in FMLLR or VTLN or i-vector. Our proposed approach is hence suitable for those scenarios where there is very little adaptation data. The proposed approach provides significant improvements over conventional speaker-normalization techniques when normalization is done at utterance level. The experiments done on TIMIT and 33-h subset and entire 300-h of Switchboard corpus supports our claim. With large amount of train data, the proposed pseudo speaker-normalized features outperforms conventional speaker-normalized features in the utterance-wise normalization scenario and gives consistent marginal improvements over un-normalized features.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.