HMM-Based Speech Segmentation: Improvements of Fully Automatic Approaches

Sandrine Brognaux,Thomas Drugman

doi:10.1109/taslp.2015.2456421

Abstract

Speech segmentation refers to the problem of determining the phoneme boundaries from an acoustic recording of an utterance together with its orthographic transcription. This paper focuses on a particular case of hidden Markov model (HMM)-based forced alignment in which the models are directly trained on the corpus to align. The obvious advantage of this technique is that it is applicable to any language or speaking style and does not require manually aligned data. Through a systematic step-by-step study, the role played by various training parameters (e.g. models configuration, number of training iterations) on the alignment accuracy is assessed, with corpora varying in speaking style and language. Based on a detailed analysis of the errors commonly made by this technique, we also investigate the use of additional fully automatic strategies to improve the alignment. Beside the use of supplementary acoustic features, we explore two novel approaches: an initialization of the silence models based on a voice activity detection (VAD) algorithm and the consideration of the forced alignment of the time-reversed sound. The evaluation is carried out on 12 corpora of different sizes, languages (some being under-resourced) and speaking styles. It aims at providing a comprehensive study of the alignment accuracy achieved by the different versions of the speech segmentation algorithm depending on corpus-related specificities. While the baseline method is shown to reach good alignment rates with corpora as small as 2 minutes, we also emphasize the benefit of using a few seconds of bootstrapping data. Regarding improvement methods, our results show that the insertion of additional features outperforms both other strategies. The performance of VAD, however, is shown to be notably striking on very small corpora, correcting more than 60% of the errors superior to 40 ms. Finally, the combination of the three improvement methods is also pointed out as providing the highest alignment rates, with very low variability across the corpora, regardless of their size. This combined technique is shown to outperform available speaker-independent models, improving the alignment rate by 8 to 10% absolute.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

HMM-Based Speech Segmentation: Improvements of Fully Automatic Approaches

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2016
Citations: 71

Similar Papers

Speaker Identification Using Empirical Mode Decomposition-Based Voice Activity Detection Algorithm under Realistic Conditions
M.S Rudramurthy ... V Kamakshi Prasad
Journal of Intelligent Systems | VOL. 23
M.S Rudramurthy, et. al.M.S Rudramurthy ... V Kamakshi Prasad
02 Apr 2014
Journal of Intelligent Systems | VOL. 23

An improved noise-robust voice activity detector based on hidden semi-Markov models
Yuan Liang ... Baosong Shan
Pattern Recognition Letters | VOL. 32
Yuan Liang, et. al.Yuan Liang ... Baosong Shan
21 Feb 2011
Pattern Recognition Letters | VOL. 32

Voice Activity Detector for Noise Spectrum Estimation Using a Dynamic Band-Splitting Entropy Estimate
Kun-Ching Wang
International Journal of Computers and Applications | VOL. 33
Kun-Ching WangKun-Ching Wang
01 Jan 2010
International Journal of Computers and Applications | VOL. 33

Voice activity detection based on support vector machine using effective feature vectors
Q-Haing Jo ... Kye-Hwan Lee
-
Q-Haing Jo, et. al.Q-Haing Jo ... Kye-Hwan Lee
27 Aug 2007
27 Aug 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

HMM-Based Speech Segmentation: Improvements of Fully Automatic Approaches

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing