A segmental framework for fully-unsupervised large-vocabulary speech recognition

Herman Kamper,Aren Jansen,Sharon Goldwater

doi:10.1016/j.csl.2017.04.008

Herman Kamper, Aren Jansen + Show 1 more

Open Access

https://doi.org/10.1016/j.csl.2017.04.008

Copy DOI

Abstract

Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early term discovery systems focused on identifying isolated recurring patterns in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units—effectively performing unsupervised speech recognition. This article presents the first attempt we are aware of to apply such a system to large-vocabulary multi-speaker data. Our system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a variety of measures including word error rate (obtained by mapping the unsupervised output to ground truth transcriptions). Very high word error rates are reported—in the order of 70–80% for speaker-dependent and 80–95% for speaker-independent systems—highlighting the difficulty of this task. Nevertheless, in terms of cluster quality and word segmentation metrics, we show that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding). Our system’s discovered clusters are still less pure than those of unsupervised term discovery systems, but provide far greater coverage.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computer Speech & Language	Publication Date: May 18, 2017
Citations: 87	License type: other-oa

R Discovery Prime

R Discovery Prime

A segmental framework for fully-unsupervised large-vocabulary speech recognition

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Similar Papers

A hybrid CTC+Attention model based on end-to-end framework for multilingual speech recognition
Sendong Liang ... Wei Qi Yan
Multimedia Tools and Applications | VOL. 81
Sendong Liang, et. al.Sendong Liang ... Wei Qi Yan
20 May 2022
Multimedia Tools and Applications | VOL. 81

Development of Speaker-Independent Automatic Speech Recognition System for Kannada Language
Praveen Kumar ... H S Jayanna
Indian Journal of Science and Technology | VOL. 15
Praveen Kumar, et. al.Praveen Kumar ... H S Jayanna
27 Feb 2022
Indian Journal of Science and Technology | VOL. 15

A Joint Training Framework for Robust Automatic Speech Recognition
Zhong-Qiu Wang ... Deliang Wang
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 24
Zhong-Qiu Wang, et. al.Zhong-Qiu Wang ... Deliang Wang
01 Apr 2016
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 24

Theoretical Analysis of Diversity in an Ensemble of Automatic Speech Recognition Systems
Kartik Audhkhasi ... Shrikanth S Narayanan
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 22
Kartik Audhkhasi, et. al.Kartik Audhkhasi ... Shrikanth S Narayanan
01 Mar 2014
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A segmental framework for fully-unsupervised large-vocabulary speech recognition

Abstract

Talk to us

Similar Papers

More From: Computer Speech &amp; Language

More From: Computer Speech & Language