Developing STT and KWS systems using limited language resources

Viet-Bac Le,Lori Lamel,William Hartmann,Jean-Luc Gauvain,Julien Despres,Anindya Roy,Abdel Messaoudi,Cécile Woehrling

doi:10.21437/interspeech.2014-527

Abstract

This paper presents recent progress in developing speech-totext (STT) and keyword spotting (KWS) systems for the 2014 IARPA-Babel evaluation. Systems have been developed for the limited language pack condition for four of the five development languages in this program phase: Assamese, Bengali, Haitian Creole and Zulu. The systems have several novel characteristics that support rapid development of KWS systems. On the STT side different acoustic units are explored based on phonemic or graphemic representations, and system combination is used to improve STT performance. The acoustic models are trained on only 10 hours of speech data with manual transcriptions, completed with unsupervised training on additional untranscribed data. Both word and subword units (morphologically decomposed, syllables, phonemes) are used for KWS. The KWS systems are based on the multi-hypotheses produced by a consensus network decoding or searching word lattices. The word error rates of the individual STT systems are on the order of 50-60%, and the KWS systems obtain Maximum Term Weighted Values ranging from 30-45% for all keywords (invocabulary and out-of-vocabulary (OOV)). Sub-word units are shown to be successful at locating some of the OOV keywords, and system combination improves system performance. Index Terms: STT, KWS, semi-supervised training, lattice, consensus network, sub-word lexical units, Morfessor

Full Text