Query-by-example Spoken Term Detection Research Articles

A speech spectrum is known to be changed by the variations in the length of the vocal tract of a speaker. This is because of the fact that speech formants are inversely related to the vocal tract length (VTL). The process of compensating spectral variation due to the length of the vocal tract is known as Vocal Tract Length Normalization (VTLN). VTLN is a very important speaker normalization technique for speech recognition and related tasks. In this paper, we used Gaussian Posteriorgram (GP) of VTL-warped spectral features for a Query-by-Example Spoken Term Detection (QbE-STD) task. This paper presents the use of a Gaussian Mixture Model (GMM) framework for VTLN warping factor estimation. In particular, the presented GMM framework does not require phoneme-level transcription. We observed the correlation between the VTLN warping factor estimates obtained via a supervised HMM-based approach and an unsupervised GMM-based approach. In addition, a phoneme recognition and speaker de-identification tasks were conducted using GMM-based VTLN warping factor estimates. For QbE-STD, we considered three spectral features, namely, Mel Frequency Cepstral Coefficients (MFCC), Perceptual Linear Prediction (PLP), and MFCC-TMP (which uses Teager Energy Operator (TEO) to exploit implicitly magnitude and phase information in the MFCC framework). Linear frequency scaling variations for VTLN warping factor are incorporated into these three cepstral representations for the QbE-STD task. Similarly, VTL-warped Gaussian posteriorgram improved the Maximum Term Weighted Value by 0.021 (i.e., 2.1%), and 0.015 (i.e., 1.5%), for MFCC and PLP feature sets, respectively, on the evaluation set of the MediaEval SWS 2013 corpus. The better performance is primarily due to VTLN warping factor estimation using unsupervised GMM framework. Finally, the effectiveness of the proposed VTL-warped GP is presented to rescore using various detection sources, such as depth of detection valley, Self-Similarity Matrix, Pseudo Relevance Feedback and weighted mean features.

Read full abstract

Query-by-Example approach of spoken content retrieval has gained much attention because of its feasibility in the absence of speech recognition and its applicability in a multilingual matching scenario. This approach to retrieve spoken content is referred to as Query-by-Example Spoken Term Detection (QbE-STD). The state-of-the-art QbE-STD system performs matching between the frame sequence of query and test utterance via Dynamic Time Warping (DTW) algorithm. In realistic scenarios, there is a need to retrieve the query which does not appear exactly in the spoken document. However, the appeared instance of query might have the different suffix, prefix or word order. The DTW algorithm monotonically aligns the two sequences and hence, it is not suitable to perform partial matching between the frame sequence of query and test utterance. In this paper, we propose novel partial matching approach between spoken query and utterance using modified DTW algorithm where multiple warping paths are constructed for each query and test utterance pair. Next, we address the research issue associated with search complexity of DTW and suggest two approaches, namely, feature reduction approach and Bag-of-Acoustic-Words (BoAW) model. In feature reduction approach, the number of feature vectors is reduced by averaging across the consecutive frames within phonetic boundaries. Thus, a lesser number of feature vectors require fewer number of comparisons and hence, DTW speeds up the search computation. The search computation time gets reduced by 46–49% with a slight degradation in performance as compared to no feature reduction case. In BoAW model, we construct term frequency-inverse document frequency (tf−idf) vectors at segment-level to retrieve audio documents. The proposed segment-level BoAW model is used to match test utterance with a query using (tf−idf) vectors and the scores obtained are used to rank the test utterance. The BoAW model gave more than 80% recall value on 70% top retrieval. To re-score the detection, we further employ DTW search or modified DTW search to retrieve the spoken query from the selected utterances using BoAW model. QbE-STD experiments are conducted on different international benchmarks, namely, MediaEval spoken web search SWS 2013 and MediaEval query-by-example search on speech QUESST 2014.

Read full abstract

Query-by-example Spoken Term Detection Research Articles

Related Topics

Articles published on Query-by-example Spoken Term Detection

Neural Network Based End-to-End Query by Example Spoken Term Detection

Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks

Vocal Tract Length Normalization using a Gaussian mixture model framework for query-by-example spoken term detection

Sparse Subspace Modeling for Query by Example Spoken Term Detection

Design of mixture of GMMs for Query-by-Example Spoken Term Detection

Unsupervised Discovery of Structured Acoustic Tokens With Applications to Spoken Term Detection

Multitask Feature Learning for Low-Resource Query-by-Example Spoken Term Detection

Partial matching and search space reduction for QbE-STD

Query-by-Example Spoken Term Detection using Frequency Domain Linear Prediction and Non-Segmental Dynamic Time Warping

Comparison of methods for language-dependent and language-independent query-by-example spoken term detection

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Query-by-example Spoken Term Detection Research Articles

Related Topics

Articles published on Query-by-example Spoken Term Detection

Neural Network Based End-to-End Query by Example Spoken Term Detection

Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks

Vocal Tract Length Normalization using a Gaussian mixture model framework for query-by-example spoken term detection

Sparse Subspace Modeling for Query by Example Spoken Term Detection

Design of mixture of GMMs for Query-by-Example Spoken Term Detection

Unsupervised Discovery of Structured Acoustic Tokens With Applications to Spoken Term Detection

Multitask Feature Learning for Low-Resource Query-by-Example Spoken Term Detection

Partial matching and search space reduction for QbE-STD

Query-by-Example Spoken Term Detection using Frequency Domain Linear Prediction and Non-Segmental Dynamic Time Warping

Comparison of methods for language-dependent and language-independent query-by-example spoken term detection