Abstract
Speech segmentation refers to the problem of determining the phoneme boundaries from an acoustic recording of an utterance together with its orthographic transcription. This paper focuses on a particular case of hidden Markov model (HMM)-based forced alignment in which the models are directly trained on the corpus to align. The obvious advantage of this technique is that it is applicable to any language or speaking style and does not require manually aligned data. Through a systematic step-by-step study, the role played by various training parameters (e.g. models configuration, number of training iterations) on the alignment accuracy is assessed, with corpora varying in speaking style and language. Based on a detailed analysis of the errors commonly made by this technique, we also investigate the use of additional fully automatic strategies to improve the alignment. Beside the use of supplementary acoustic features, we explore two novel approaches: an initialization of the silence models based on a voice activity detection (VAD) algorithm and the consideration of the forced alignment of the time-reversed sound. The evaluation is carried out on 12 corpora of different sizes, languages (some being under-resourced) and speaking styles. It aims at providing a comprehensive study of the alignment accuracy achieved by the different versions of the speech segmentation algorithm depending on corpus-related specificities. While the baseline method is shown to reach good alignment rates with corpora as small as 2 minutes, we also emphasize the benefit of using a few seconds of bootstrapping data. Regarding improvement methods, our results show that the insertion of additional features outperforms both other strategies. The performance of VAD, however, is shown to be notably striking on very small corpora, correcting more than 60% of the errors superior to 40 ms. Finally, the combination of the three improvement methods is also pointed out as providing the highest alignment rates, with very low variability across the corpora, regardless of their size. This combined technique is shown to outperform available speaker-independent models, improving the alignment rate by 8 to 10% absolute.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.