Abstract

This paper presents the design of a mixture of Gaussian Mixture Models (GMMs) for Query-by-Example Spoken Term Detection (QbE-STD). The speech data governs acoustically similar broad phonetic structures. To capture broad phonetic structure, we exploit additional information of broad phoneme classes (such as vowels, semi-vowels, nasals, fricatives, and plosives) for the training of the GMM. The mixture of GMMs is tied with GMMs of these broad phoneme classes, i.e., each GMM expresses the probability density function (pdf) of a broad phoneme category. The Expectation Maximization (EM) algorithm is used to obtain the GMM for each broad phoneme class. Thus, a mixture of GMMs represents the spoken query with the broad phonetic constraints. These constraints restrict the posterior probability within the broad class, which results into a better posteriorgram design. The novelty of our work lies in prior probability assignments (as weights of the mixture of GMMs) for better Gaussian posteriorgram design. The proposed simple yet effective posteriorgram outperform Gaussian posteriorgram because of its implicit constraints supplied by broad phonetic posteriors. The Maximum Term Weighted Value (MTWV) for SWS 2013 dataset is improved by 0.052, and 0.051 w.r.t. Gaussian posteriorgram for Mel Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Prediction (PLP), respectively. We found that the proposed mixture of GMMs approach gave consistently better performance than the Gaussian posteriorgram across various evaluation factors, such as different cepstral representations, number of Gaussian components, the number of spoken examples per query, and effect of amount of labeled data used for broad phoneme posterior computation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call