Abstract

Acoustic modelling in the absence of labelled audio is difficult in speech processing, especially in under-resourced languages. Ideas from theories of speech production and perception can aid acoustic modelling in such a setting. Several production and perception related studies have shown the importance of the dynamic nature of speech. In the present work, an attempt is made to discover and model the dynamic nature of the speech signal. Specifically, speech is modelled as a sequence of transient and steady-state units. Model initialisation, which is crucial for unsupervised acoustic modelling, is performed using the syllabic structure present in the speech signal. The proposed method has similarities with the distinctive region model (DRM) for speech production, where the dynamic regions are assumed to be contained within syllable-like segments. An analysis of the discovered units reveals that the units are of transient and steady-state forms. The steady-state units predominantly correspond to vowels. The transient units correspond to nasal, approximant, fricative, and stop transients. Finally, the effectiveness of the proposed method is explored by applying the acoustic units to zero-resource text-to-speech synthesis and unsupervised keyword spotting tasks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call