Abstract

Prosodic cues such as the word prominence play a fundamental role in human communication, e.g., to express important information. Since different speakers use a wide variety of features to express prominence, there is a large difference in performance between speaker dependently and speaker independently trained models. To cope with these variations without training a new speaker dependent model, in speech recognition speaker adaptation techniques such as feature-space Maximum Likelihood Linear Regression (fMLLR) turned out to be very useful. These methods are developed for GMM-HMM based classifiers under the assumption that the data can be well modeled via the mixture of a few Gaussian distributions. However, in many cases these assumptions are too restrictive. In particular a discriminative classifier such as an SVM often yields far superior results to a GMM. Therefore, we propose a new adaptation method, which adapts the data to the radial basis function kernel of the SVM. To avoid overfitting we apply two regularization terms. The first is based on fMLLR and the second is an L1 regularization to enforce a sparse transformation matrix. We analyze the method in the context of speaker adaptation for word prominence detection, with varying amounts of adaptation data and different weights of the regularization terms. We show that our novel method clearly outperforms fMLLR-GMM and fMLLR-SVM based adaptation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.