Acoustic Word Embeddings Research Articles

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. Such embeddings can form the basis for speech search, indexing and discovery systems when conventional speech recognition is not possible. In zero-resource settings where unlabelled speech is the only available resource, we need a method that gives robust embeddings on an arbitrary language. Here we explore multilingual transfer: we train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zero-resource languages. We consider three multilingual recurrent neural network (RNN) models: a classifier trained on the joint vocabularies of all training languages; a Siamese RNN trained to discriminate between same and different words from multiple languages; and a correspondence autoencoder (CAE) RNN trained to reconstruct word pairs. In a word discrimination task on six target languages, all of these models outperform state-of-the-art unsupervised models trained on the zero-resource languages themselves, giving relative improvements of more than 30% in average precision. When using only a few training languages, the multilingual CAE-RNN performs better, but with more training languages the other multilingual models perform similarly. Using more training languages is generally beneficial, but improvements are marginal on some languages. We present probing experiments which show that the CAE-RNN encodes more phonetic, word duration, language identity and speaker information than the other multilingual models.

Read full abstract

Acoustic word embeddings (AWEs) have been popular in low-resource query-by-example speech search. They are using vector distances to find the spoken query in search content, which has much lower computation than the conventional dynamic time warping (DTW)-based approaches. The AWE networks are usually trained using variable-length isolated spoken words, while they are applied to fixed-length speech segments obtained by shifting an analysis window on speech content. There is an obvious mismatch between the learning of AWEs and its application on search content. To mitigate such mismatch, we propose to include temporal context information on spoken word pairs to learn recurrent neural AWEs. More specifically, the spoken word pairs are represented by multi-lingual bottleneck features (BNFs) and padded with the neighboring frames of the target spoken words to form fixed-length speech segment pairs. A deep bidirectional long short-term memory (BLSTM) network is then trained with a triplet loss using the speech segment pairs. Recurrent neural AWEs are obtained by concatenating the BLSTM backward and forward outputs. During QbE speech search stage, both spoken query and search content are converted into recurrent neural AWEs. Cosine distances are then measured between them to find the spoken query. The experiments show that using temporal context is essential to alleviate the mismatch. The proposed recurrent neural AWEs trained with temporal context outperform the previous state-of-art features with 12.5% relative mean average precision (MAP) improvement on QbE speech search.

Read full abstract

Acoustic Word Embeddings Research Articles

Articles published on Acoustic Word Embeddings

Enhancing spoken term detection with deep acoustic word embeddings and cross-modal matching techniques

Leveraging Multilingual Transfer for Unsupervised Semantic Acoustic Word Embeddings

Acoustic Word Embeddings for End-to-End Speech Synthesis

Improved Acoustic Word Embeddings for Zero-Resource Languages Using Multilingual Transfer

Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks

Fast Query-by-example Speech Search using Attention-based Deep Binary Embeddings

Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings With Temporal Context

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Acoustic Word Embeddings Research Articles

Articles published on Acoustic Word Embeddings

Enhancing spoken term detection with deep acoustic word embeddings and cross-modal matching techniques

Leveraging Multilingual Transfer for Unsupervised Semantic Acoustic Word Embeddings

Acoustic Word Embeddings for End-to-End Speech Synthesis

Improved Acoustic Word Embeddings for Zero-Resource Languages Using Multilingual Transfer

Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks

Fast Query-by-example Speech Search using Attention-based Deep Binary Embeddings

Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings With Temporal Context