Abstract
The ensemble speaker and speaking environment modeling (ESSEM) framework was designed to provide online optimization for enhancing workable systems under real-world conditions. In the ESSEM framework, ensemble models are built in the offline phase to characterize specific environments based on local statistics prepared from those particular conditions. In the online phase, a mapping function is computed based on the incoming testing data to perform model adaptation. Previous studies utilized linear combination (LC) and linear combination with a correction bias (LCB) as simple mapping functions that only apply one weighting coefficient on each model. In order to better utilize the ensemble models, this study presents a generalized affine transform group (ATG) mapping function for the ESSEM framework. Although ATG characterizes unknown testing conditions more precisely using a larger amount of parameters, over-fitting issues occur when the available adaptation data is especially limited. This study handles over-fitting issues with three optimization processes: maximum a posteriori (MAP) criterion, model selection (MS), and cohort selection (CS). Experimental results showed that ATG along with the three optimization processes enabled the ESSEM framework to allow unsupervised model adaptation using only one utterance to provide consistent performance improvements.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.