Untranscribed Speech Research Articles

Keyword localisation is the task of finding where in a speech utterance a given query keyword occurs. We investigate to what extent keyword localisation is possible using a visually grounded speech (VGS) model. VGS models are trained on unlabelled images paired with spoken captions. These models are therefore self-supervised -- trained without any explicit textual label or location information. To obtain training targets, we first tag training images with soft text labels using a pretrained visual classifier with a fixed vocabulary. This enables a VGS model to predict the presence of a written keyword in an utterance, but not its location. We consider four ways to equip VGS models with localisations capabilities. Two of these -- a saliency approach and input masking -- can be applied to an arbitrary prediction model after training, while the other two -- attention and a score aggregation approach -- are incorporated directly into the structure of the model. Masked-based localisation gives some of the best reported localisation scores from a VGS model, with an accuracy of 57% when the system knows that a keyword occurs in an utterance and need to predict its location. In a setting where localisation is performed after detection, an $F_1$ of 25% is achieved, and in a setting where a keyword spotting ranking pass is first performed, we get a localisation P@10 of 32%. While these scores are modest compared to the idealised setting with unordered bag-of-word-supervision (from transcriptions), these models do not receive any textual or location supervision. Further analyses show that these models are limited by the first detection or ranking pass. Moreover, individual keyword localisation performance is correlated with the tagging performance from the visual classifier. We also show qualitatively how and where semantic mistakes occur, e.g. that the model locates surfer when queried with ocean.

We consider feature learning for a computationally efficient method of keyword spotting that can be applied in severely under-resourced settings. The objective is to support humanitarian relief programmes by the United Nations (UN) in parts of Africa in which almost no language resources are available. To allow a keyword spotting system to be rapidly developed in such a language, we rely on a small and easily-compiled set of isolated keywords. Using the isolated keywords as templates, we apply dynamic time warping (DTW) to a much larger corpus of in-domain but untranscribed speech. The resulting DTW alignment scores are used to train a convolutional neural network (CNN) which is orders of magnitude more computationally efficient than DTW and therefore suitable for real-time application. We optimise this ASR-free neural network keyword spotting procedure by identifying acoustic features that provide robust performance in this almost zero-resource setting. First, we consider the benefits of incorporating information from well-resourced but unrelated languages by incorporating a multilingual bottleneck feature (BNF) extractor. Next, we consider using features extracted from an autoencoder (AE) trained on in-domain but untranscribed data. Finally, we consider features obtained from a correspondence autoencoder (CAE) which is initialised with the AE and subsequently fine-tuned on the small set of in-domain labelled data. Experiments in South African English and Luganda, a low-resource language, demonstrate that, on their own, both the BNF and CAE features can achieve a 5% relative performance improvement over baseline MFCCs. However, by using BNFs as input to the CAE, even better performance is achieved, resulting in a more than 27% relative improvement over MFCCs in ROC area-under-the-curve (AUC) and more than twice as many top-10 retrievals. We also show that, using these features, the CNN-DTW keyword spotter performs almost as well as the DTW keyword spotter while comfortably outperforming a baseline CNN trained only on the keyword templates. We conclude that a CNN-DTW keyword spotter using BNF-derived CAE features represents a computationally efficient approach with very competitive performance that is suited to rapid deployment in a severely under-resourced scenario.

Untranscribed Speech Research Articles

Articles published on Untranscribed Speech

Enhancing analysis of diadochokinetic speech using deep neural networks

Keyword Localisation in Untranscribed Speech Using Visually Grounded Speech Models

Feature learning for efficient ASR-free keyword spotting in low-resource languages

NAUTILUS: A Versatile Voice Cloning System

An Unsupervised Adaptation Approach to Leveraging Feedback Loop Data by Using i-Vector for Data Clustering and Selection

Multi-View and Multi-Objective Semi-Supervised Learning for HMM-Based Automatic Speech Recognition

Unsupervised training of acoustic models for large vocabulary continuous speech recognition

Learning visually grounded words and syntax of natural spoken language

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Untranscribed Speech Research Articles

Articles published on Untranscribed Speech

Enhancing analysis of diadochokinetic speech using deep neural networks

Keyword Localisation in Untranscribed Speech Using Visually Grounded Speech Models

Feature learning for efficient ASR-free keyword spotting in low-resource languages

NAUTILUS: A Versatile Voice Cloning System

An Unsupervised Adaptation Approach to Leveraging Feedback Loop Data by Using i-Vector for Data Clustering and Selection

Multi-View and Multi-Objective Semi-Supervised Learning for HMM-Based Automatic Speech Recognition

Unsupervised training of acoustic models for large vocabulary continuous speech recognition

Learning visually grounded words and syntax of natural spoken language