Unsupervised Speech Segmentation and Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding

Saurabhchand Bhati,Laureano Moro-Velazquez,Najim Dehak,Piotr Zelasko,Jesus Villalba

doi:10.1109/taslp.2022.3180684

Saurabhchand Bhati, Laureano Moro-Velazquez + Show 3 more

Open Access

https://doi.org/10.1109/taslp.2022.3180684

Copy DOI

Abstract

Typically, unsupervised segmentation of speech into the phone- and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each other. Recent attempts employ self-supervised learning, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal’s frame-level structure. We overcome this limitation with a segmental contrastive predictive coding (SCPC) framework to model the signal structure at a higher level, e.g., phone level. A convolutional neural network learns frame-level representation from the raw waveform via noise-contrastive estimation (NCE). A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE to learn segment representations. The differentiable boundary detector allows us to train frame-level and segment-level encoders jointly. Experiments show that our single model outperforms existing phone and word segmentation methods on TIMIT and Buckeye datasets. We analyze the impact of the threshold on boundary detector performance, and our results suggest that automatically learning the boundary threshold can be as effective as manually tuning that threshold. We discover that phone class impacts the boundary detection performance, and the boundaries between successive vowels or semivowels are the most difficult. Finally, we use SCPC to extract speech features at the segment level rather than at the uniformly spaced frame level (e.g., 10 ms) and produce variable rate representations that change according to the contents of the utterance. We can lower the feature extraction rate from the typical 100 Hz to as low as 14.5 Hz on average while still outperforming the hand-crafted features such as MFCC on the linear phone classification task.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Unsupervised Speech Segmentation and Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2022
Citations: 13

Similar Papers

Frame and Segment Level Recurrent Neural Networks for Phone Classification
Martin Ratajczak ... Sebastian Tschiatschek
-
Martin Ratajczak, et. al.Martin Ratajczak ... Sebastian Tschiatschek
20 Aug 2017
20 Aug 2017

SF-TMN: SlowFast temporal modeling network for surgical phase recognition.
Bokai Zhang ... Bharti Goel
International Journal of Computer Assisted Radiology and Surgery | VOL. 19
Bokai Zhang, et. al.Bokai Zhang ... Bharti Goel
21 Mar 2024
International Journal of Computer Assisted Radiology and Surgery | VOL. 19

Word-level invariant representations from acoustic waveforms
Stephen Voinea ... Lorenzo Rosasco
-
Stephen Voinea, et. al.Stephen Voinea ... Lorenzo Rosasco
14 Sep 2014
14 Sep 2014

Automatic prediction of presentation style and student engagement from videos
Chinchu Thomas ... Srujan Swaroop Gajula
Computers and Education: Artificial Intelligence | VOL. 3
Chinchu Thomas, et. al.Chinchu Thomas ... Srujan Swaroop Gajula
01 Jan 2021
Computers and Education: Artificial Intelligence | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unsupervised Speech Segmentation and Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing