Active learning based data selection for limited resource STT and KWS

Thiago Fraga-Silva,Lori Lamel,Antoine Laurent,Abdel Messaoudi,Jean-Luc Gauvain,Viet-Bac Le

doi:10.21437/interspeech.2015-636

Abstract

This paper presents first results in using active learning (AL) for training data selection in the context of the IARPABabel program. Given an initial training data set, we aim to automatically select additional data (from an untranscribed pool data set) for manual transcription. Initial and selected data are then used to build acoustic and language models for speech recognition. The goal of the AL task is to outperform a baseline system built using a pre-defined data selection with the same amount of data, the Very Limited Language Pack (VLLP) condition. AL methods based on different selection criteria have been explored. Compared to the VLLP baseline, improvements are obtained in terms of Word Error Rate and Actual Term Weighted Values for the Lithuanian language. A description of methods and an analysis of the results are given. The AL selection also outperforms the VLLP baseline for other IARPABabel languages, and will be further tested in the upcoming NIST OpenKWS 2015 evaluation. Index Terms: active learning, low-resourced STT, KWS.

Full Text