Significance of spectral cues in automatic speech segmentation for Indian language speech synthesizers

Arun Baby,Jeena J Prakash,Aswin Shanmugam Subramanian,Hema A Murthy

doi:10.1016/j.specom.2020.06.002

Abstract

Building speech synthesis systems for Indian languages is challenging owing to the fact that digital resources for these languages are hardly available. Vocabulary independent speech synthesis requires that a given text is split at the level of the smallest sound unit, namely, phone. The waveforms or models of phones are concatenated to produce speech. The waveforms corresponding to that of the phones are obtained manual (listening and marking) when digital resources are scarce. But the manual labeling of speech data (also known as speech segmentation) can lead to inconsistencies as the duration of phones can be as short as 10ms.The most common approach to automatic segmentation of speech is to perform forced alignment using monophone hidden Markov models (HMMs) that have been obtained using embedded re-estimation after flat start initialization. These results are then used in neural network frameworks to build better acoustic models for speech synthesis/recognition. Segmentation using this approach requires large amounts of data and does not work very well for low resource languages. To address the issue of paucity of data, signal processing cues like short-term energy (STE) and sub-band spectral flux (SBSF) are used in tandem with HMM based forced alignment for automatic speech segmentation.STE and SBSF are computed on the speech waveforms. STE yields syllable boundaries, while SBSF provides locations of significant change in spectral flux that are indicative of fricatives, affricates, and nasals. STE and SBSF cannot be used directly to segment an utterance. Minimum phase group delay based smoothing is performed to preserve these landmarks, while at the same time reducing the local fluctuations. The boundaries obtained with HMMs are corrected at the syllable level, wherever it is known that the syllable boundaries are correct. Embedded re-estimation of monophone HMM models is again performed using the corrected alignment. Thus, using signal processing cues and HMM re-estimation in tandem, robust monophone HMM models are built. These models are then used in Gaussian mixture model (GMM), deep neural network (DNN) and convolutional neural network (CNN) frameworks to obtain state-level frame posteriors. The boundaries are again iteratively corrected and re-estimated.Text-to-speech (TTS) systems are built for different Indian languages using phone alignments obtained with and without the use of signal processing based boundary corrections. Unit selection based and statistical parametric based TTS systems are built. The result of the listening tests showed a significant improvement in the quality of synthesis with the use of signal processing based boundary correction.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Significance of spectral cues in automatic speech segmentation for Indian language speech synthesizers

Abstract

Talk to us

Similar Papers

More From: Speech Communication

Lead the way for us

Journal: Speech Communication	Publication Date: Jun 27, 2020
Citations: 8

Similar Papers

Exploration of vowel onset and offset points for hybrid speech segmentation
Biswajit Dev Sarma ... Hema A Murthy
-
Biswajit Dev Sarma, et. al.Biswajit Dev Sarma ... Hema A Murthy
01 Nov 2015
01 Nov 2015

Modelling state-transition dynamics in resting-state brain signals by the hidden Markov and Gaussian mixture models.
Takahiro Ezaki ... Naoki Masuda
European Journal of Neuroscience | VOL. 54
Takahiro Ezaki, et. al.Takahiro Ezaki ... Naoki Masuda
22 Jul 2021
European Journal of Neuroscience | VOL. 54

HMM Mixtures (HMM2) for Robust Speech Recognition

-

01 Jan 2003
01 Jan 2003

Generic Audio Classification Using a Hybrid Model Based on GMMs and HMMs
M Rajapakse ... L Wyse
-
M Rajapakse, et. al.M Rajapakse ... L Wyse
12 Jan 2005
12 Jan 2005

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Significance of spectral cues in automatic speech segmentation for Indian language speech synthesizers

Abstract

Talk to us

Similar Papers

More From: Speech Communication