Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks

Elena Pirogova,Margaret Lech,Tadeusz A Wysocki,Denis Shitov

doi:10.1109/access.2020.2999055

Abstract

In the last years, acoustic word embeddings (AWEs) have gained significant interest in the research community. It applies specifically to the application of acoustic embeddings in the Query-by-Example Spoken Term Detection (QbE-STD) search and related word discrimination tasks. It has been shown that AWEs learned for the word or phone classification in one or several languages can outperform approaches that use dynamic time warping (DTW). In this paper, a new method of learning AWEs in the DTW framework is proposed. It employs a multitask triplet neural network to generate the AWEs. The triplet network learns acoustic representations of words through a comparison of DTW distances. In addition, a multitask objective, including a conventional word classification component, and a triplet loss component is proposed. The triplet loss component applies the DTW distance for the word discrimination task. The multitask objective ensures that the embeddings can be used with DTW directly. Experimental validation shows that the proposed approach is well-suited, but not necessarily restricted to the QbE-STD search. A comparison with several baseline methods shows that the new method leads to a significant improvement of the results on the word discrimination task. An evaluation of the word clustering in the learned embedding space is presented.

Highlights

The Dynamic Time Warping (DTW) algorithm, introduced more than two decades ago [1], finds the optimal alignment between points of two time-series
Given that the proposed approach has no constraints with regards to the choice of the distance measure aside from it being differentiable, as we show the Euclidean distance can be substituted by the Soft-DTW-distance
Experimental validation results have shown that the Soft-DTW-triplet trained with weak supervision achieves Average Precision (AP) equal to 0.737 on Speech Commands dataset as opposed to Siamese Long Short-Term Memory (LSTM) with AP of 0.682 adopted from [10]

Summary

Introduction

The Dynamic Time Warping (DTW) algorithm, introduced more than two decades ago [1], finds the optimal alignment between points of two time-series. By using a nonlinear mapping of samples from one time-series into another one, this method achieves an effective alignment despite possible local temporal or phase distortions issues. In speech recognition and classification tasks, the DTW is used along with Mel-frequency cepstral coefficients (MFCCs) as a feature representation of acoustic time-series. It has been shown that direct application of MFCCs into DTW often becomes a limiting factor affecting the overall system performance since the same speech units can be pronounced. These differences are natural results of high variability in the physical anatomy of the human vocal tract depending on factors such as the speaker’s sex, age, accent, cognitive load, or emotional state

Methods

Results

Conclusion