Word-level invariant representations from acoustic waveforms

Stephen Voinea,Tomaso Poggio,Georgios Evangelopoulos,Chiyuan Zhang,Lorenzo Rosasco

doi:10.21437/interspeech.2014-518

Abstract

Extracting discriminant, transformation-invariant features from raw audio signals remains a serious challenge for speech recognition. The issue of speaker variability is central to this problem, as changes in accent, dialect, gender, and age alter the sound waveform of speech units at multiple levels (phonemes, words, or phrases). Approaches for dealing with this variability have typically focused on analyzing the spectral properties of speech at the level of frames, on par with frame-level acoustic modeling usually applied to speech recognition systems. In this paper, we propose a framework for representing speech at the word level and extracting features from the acoustic, temporal domain, without the need for spectral encoding or preprocessing. Leveraging recent work on unsupervised learning of invariant sensory representations, we extract a signature for a word by first projecting its raw waveform onto a set of templates and their transformations, and then forming empirical estimates of the resulting one-dimensional distributions via histograms. The representation and relevant parameters are evaluated for word classification on a series of datasets with increasing speakermismatch difficulty, and the results are compared to those of an MFCC-based representation. Index Terms: invariance, acoustic features, speech representation, word classification

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Word-level invariant representations from acoustic waveforms

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Phone classification by a hierarchy of invariant representation layers
Chiyuan Zhang ... Stephen Voinea
-
Chiyuan Zhang, et. al.Chiyuan Zhang ... Stephen Voinea
14 Sep 2014
14 Sep 2014

Attention-based Wav2Text with feature transfer learning
Andros Tjandra ... Satoshi Nakamura
-
Andros Tjandra, et. al.Andros Tjandra ... Satoshi Nakamura
01 Dec 2017
01 Dec 2017

Bat-inspired dynamic features and factors that modulate their impact on speech recognition
Alexander Hsu ... Xiaodong Cui
The Journal of the Acoustical Society of America | VOL. 144
Alexander Hsu, et. al.Alexander Hsu ... Xiaodong Cui
01 Sep 2018
The Journal of the Acoustical Society of America | VOL. 144

The representation of speech in a nonlinear auditory model: time-domain analysis of simulated auditory-nerve firing patterns
Guy J. Brown ... Nicholas R. Clark
-
Guy J. Brown, et. al.Guy J. Brown ... Nicholas R. Clark
27 Aug 2011
27 Aug 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Word-level invariant representations from acoustic waveforms

Abstract

Talk to us

Similar Papers